feat(hy_worldplay): HY-WorldPlay WAN-5B I2V integration#155
Conversation
Adds integrations/hy_worldplay/ following the self_forcing /
causal_forcing mini-repo pattern. Phase-1 ships a vendor-wrapper
runner that delegates to upstream's wan/generate.py WanRunner so the
output is bit-for-bit identical to torchrun wan/generate.py with the
same flags. Promotion to a real flashdreams-run subcommand and the
HunyuanVideo-1.5 8B variant are tracked as phases 2-3 in the
integration README.
- hy_worldplay/{config.py, runner.py, cli.py}: tyro CLI exposed as
``python -m hy_worldplay.cli`` (and a ``hy-worldplay-wan-i2v-5b``
console-script entry).
- tests/test_smoke.py: 8 CPU-only checks for the runner config
surface (slug <-> runner_name, default pose invariant, missing-path
validation, CLI module imports without tyro/torch).
- tests/parity_check/: idempotent run.sh that clones upstream, pulls
``tencent/HY-WorldPlay`` checkpoints, and runs the reference
benchmark in an isolated venv.
- Top-level README + pyproject.toml: register the new path with ty /
pyright extra-paths and add a "Run HY-WorldPlay WAN-5B I2V" section
alongside the other inference recipes.
Brings the phase-1 vendor wrapper from e99d02f to numerical parity with upstream wan/generate.py (mean |Δ| 3.41/255, 0 frames over the visual threshold -- tighter than the torch 2.11->2.12 self-self drift of 3.76). Two root causes for the residual drift: 1. Parity venv resolved torch==2.12 while flashdreams main venv pins 2.11, so divergent cuBLAS / bf16 reduction paths. Pinned `torch==2.11.*` in tests/parity_check/pyproject.toml in lockstep with flashdreams/uv.lock; bump together going forward. 2. DEFAULT_PROMPT in hy_worldplay/runner.py had a trailing `.` that upstream's argparse default does not. UMT5 tokenises the period as an extra token, shifting the conditioning embedding (~2 units of drift). Removed the period and added smoke-test parity guards (test_default_prompt_byte_matches_upstream et al.) so this cannot regress silently. Also: - pyproject.toml: add `[upstream]` optional-extra (accelerate, cloudpickle, filelock, pyyaml, remote-pdb, sageattention) needed when running `python -m hy_worldplay.cli` in-process from the main venv. Same deps added to parity_check/pyproject.toml. - parity_check/run.sh + top-level README: fix huggingface-cli download -- positional args after the repo id are matched as exact filenames, not directory prefixes, so `wan_transformer wan_distilled_model` was silently fetching 0 files. Use `--include "wan_transformer/*" ...` instead. - parity_check/README.md: replace stale `cmp` instructions with a numeric per-frame diff and document the accepted parity bar + caveats. - Commit parity venv's uv.lock for reproducibility.
… plan Addresses @liruilong940607's PR review: 1. Stop adding HY-WorldPlay's heavy deps (sageattention, accelerate, cloudpickle, remote-pdb) to the repo-root ``uv.lock`` via the ``[upstream]`` optional-extra. Drop the extra. Add ``flashdreams-hy-worldplay`` as a path source to the parity sub-venv so that env now also serves as the plugin run venv -- matches the ``self_forcing/tests/parity_check`` layout he cited. Result: root ``uv.lock`` shrinks by 60 net lines; sageattention, accelerate, cloudpickle, remote-pdb all drop out; zero packages added. The heavy stack now lives only in ``integrations/hy_worldplay/tests/parity_check/uv.lock``. 2. Rewrite the staging plan to reflect his phase-2 direction. Today ``flashdreams/recipes/wan/`` ships Wan 2.1 (1.3B / 14B) and Wan 2.2 14B, but not Wan 2.2 5B -- which is HY-WorldPlay's actual backbone. New plan splits the work into 2a (add wan2.2-5B recipe to flashdreams/, useful in its own right) and 2b (layer HY-WorldPlay's action + trajectory + memory hooks on top of 2a, promote slug to ``flashdreams-run`` subcommand, collapse parity sub-venv back into the main flashdreams venv). 3. Update both READMEs to document the new 2-layer install (lightweight workspace member + isolated run/parity sub-venv) and the ``uv run --project <parity-check> ...`` invocation pattern. 4. Clear stale TODO in parity_check/README.md -- the prompt-byte-match test already landed in test_smoke.py last commit (test_default_prompt_byte_matches_upstream and its negative-prompt twin). No behavioural change to the plugin runner; the parity bar from 6fb9321 is unchanged. 10/10 smoke tests still pass; parity sub-venv resolves cleanly with the new ``flashdreams-hy-worldplay`` workspace member.
Address PR #103 review (Ruilong): wire the plugin into `flashdreams-run`
instead of shipping a standalone `python -m hy_worldplay.cli` entry
point.
* `flashdreams/infra/runner.py`: make `RunnerConfig.pipeline` optional
(default `None`) and guard the seed-offset + pipeline-setup blocks in
`Runner.__init__`. Purely additive -- existing recipes always pass
`pipeline=...` explicitly so behavior is unchanged. Reserves a slot
for vendor-wrapper runners that don't yet have a flashdreams
`StreamInferencePipeline` to drive (phase-1 plugins).
* `integrations/hy_worldplay/`:
- `HyWorldPlayWanI2VRunnerConfig` now subclasses `RunnerConfig` with
`pipeline=None`; inherits `runner_name`, `description`,
`output_dir`, `device`, `offset_seed_by_global_rank` from the
base. HY-specific fields stay on the subclass.
- Drop `hy_worldplay/cli.py` + the `[project.scripts]` entry; add a
`[project.entry-points."flashdreams.runner_configs"]` entry so the
slug is discovered by the same plugin loader `wan21` /
`self_forcing` use.
- `HyWorldPlayWanI2VRunner` stays a plain class (not a `Runner`
subclass) because the phase-1 wrapper owns its own distributed
setup (deferred to upstream's `WanRunner`).
- Tests swap `test_cli_module_imports` for entry-point /
`pipeline is None` / `RunnerConfig`-subclass smokes.
- README, parity-check README + pyproject, run-docker.sh, repo-root
README: all replace `python -m hy_worldplay.cli` invocations with
`flashdreams-run hy-worldplay-wan-i2v-5b`.
* Add `agentic/skills/hy-worldplay-env-setup/SKILL.md`: end-to-end
setup decision tree (two-venv layout, HF auth, ~52 GB checkpoint
provisioning, run path, common errors, extension pitfalls,
phase-2 horizon). Matches the existing `agentic/skills/` layout.
* New `integrations/hy_worldplay/run-docker.sh`: convenience wrapper
that boots the flashdreams container, runs first-time provisioning,
and dispatches the runner via `flashdreams-run` (single- or
multi-GPU via `torchrun --no-python`).
The `flashdreams-run hy-worldplay-wan-i2v-5b` slug is the stable
user-facing interface and survives the phase-2 refactor when the
native WAN 2.2 5B recipe lands and the wrapper retires.
The previous commit made ``RunnerConfig.pipeline`` optional so phase-1
vendor wrappers like ``hy_worldplay`` can leave it ``None``. That
broke every site that assumed ``cfg.pipeline`` was non-``None`` and
left the new ``hy_worldplay`` tests without a CI tier marker, which
the ``marker_enforcement`` plugin rejects with pytest exit 4.
* ``integrations/hy_worldplay/tests/test_smoke.py``: add
``pytestmark = pytest.mark.ci_cpu`` so the new module clears the
CI-tier-marker enforcement check (was the root cause of both the
CPU and GPU CI ``Tests missing a CI tier marker`` failures). Also
pass the now-required ``runner_name="hy-worldplay-wan-i2v-5b"`` to
the two ``HyWorldPlayWanI2VRunnerConfig()`` constructors that used
the dataclass default before ``runner_name`` was inherited.
* ``{flashdreams,integrations/{alpadreams,causal_forcing,fastvideo_causal_wan22,lingbot,self_forcing,wan21}}/tests/test_*.py``:
guard the ``cfg.runner_name == cfg.pipeline.recipe_name`` drift
check with ``cfg.pipeline is not None``. The check is a CLI-contract
guard for runners that *have* a pipeline; phase-1 ``None``-pipeline
runners are out of scope and would otherwise short-circuit ty.
* ``integrations/alpadreams/alpadreams/runner.py``: assert
``self.config.pipeline is not None`` before ``.diffusion_model`` --
alpadreams always sets a pipeline, the assert just narrows the
optional for ty without changing runtime semantics.
* ``integrations/hy_worldplay/hy_worldplay/runner.py``: ``diffusers``'
``export_to_video`` is typed ``list[np.ndarray] | list[PIL.Image]``
but we were passing a single ``(T, H, W, 3)`` ndarray. Split into
a per-frame list with ``list(np.asarray(video[0]))`` so ty + the
runtime ``len()`` + index-access pattern in diffusers both match.
* ``.github/scripts/sync_version.py``: skip
``hy-worldplay-parity-check`` (independent versioning, mirrors the
existing ``self-forcing-parity-check`` skip) so the sync-version
hook does not bounce its ``0.0.0`` placeholder.
* ``integrations/hy_worldplay/pyproject.toml`` + ``uv.lock`` /
``tests/parity_check/uv.lock``: bump
``flashdreams-hy-worldplay`` to ``0.1.0a2`` to match the canonical
``flashdreams/_version.py``; drop the stale ``tyro`` resolved-dep
the parity-check lock had cached from the pre-entry-point CLI.
Verified locally with ``uvx --from "ty>=0.0.33" ty check`` (passes)
and ``python3 .github/scripts/sync_version.py`` (no diff).
Phase 2a deliverable for the hy_worldplay integration: an in-tree
Wan 2.2 TI2V 5B recipe that loads directly from the HF
``Wan-AI/Wan2.2-TI2V-5B-Diffusers`` checkpoint (VAE + DiT) and reuses
the existing ``WanInferencePipeline``.
- VAE: generalise ``WanVAE`` with configurable base_dim / z_dim /
patch_size / is_residual, add ``AvgDown3D`` / ``DupUp3D`` /
``ResidualDownBlock`` / ``ResidualUpBlock`` and patchify ops, plus
``Wan22TI2V5BVAE{Encoder,Decoder}Config`` and a diffusers
state-dict transform.
- DiT: add ``WanDiTNetworkTI2V5BConfig`` (3072d / 30L / 48ch /
no CLIP cross-attn) and per-token timestep support through
``Wan21Transformer.predict_flow`` -> ``WanDiTNetwork.forward``
-> ``Block`` / ``Head`` AdaLN modulation; ``stamp_image_latent``
plus the new ``ti2v_first_frame_per_token_timestep`` flag
implements the upstream "VAE-seeded first-frame + per-token t=0"
recipe at AR step 0 while keeping AR>=1 on the scalar CUDA-graph
shape.
- Pipeline: pre-rolled ``PIPELINE_WAN22_TI2V_5B`` config exported
alongside ``WAN_CONFIGS`` for slug-style consumers.
- Revert ``flashdreams/infra/runner.py``: ``RunnerConfig.pipeline`` is required again; drop the ``| None`` and the conditional ``None``-guards in ``Runner.__init__``. - Add ``hy_worldplay/_vendor_pipeline.py`` with ``_NoopPipelineConfig`` and ``_NoopPipeline`` so the vendor wrapper satisfies the now- mandatory ``RunnerConfig.pipeline`` slot without instantiating a real flashdreams pipeline. - Pin ``HyWorldPlayWanI2VRunnerConfig.pipeline`` to ``_NoopPipelineConfig`` and polish surrounding docstrings. - Update tests/README to match (rename ``test_pipeline_is_none`` -> ``test_pipeline_is_vendor_wrapper_noop``). - Delete ``agentic/skills/hy-worldplay-env-setup/`` (it was an integration plan, not a general agentic skill). - README.md: announce that the phase-2a Wan 2.2 TI2V 5B recipe has landed in ``flashdreams.recipes.wan``.
Conflict resolutions: - flashdreams/recipes/wan/autoencoder/vae.py: dropped the dead local _set_or_copy def; origin/main extracted it to flashdreams.infra.cuda_graph.set_or_copy and this branch already imports + uses that. Kept the _WAN21_LATENT_MEAN / _WAN21_LATENT_STD rename (with back-compat aliases). - uv.lock: took origin/main and re-ran uv lock to pick up the hy_worldplay + diffusers/moviepy/proglog deps from this branch.
flashdreams main floors transformers>=5.0 (security PR #116) but upstream HY-WorldPlay's wan/ pipeline pins transformers==4.56.0 / 4.57.6 for parity reproducibility. The flashdreams code paths used in the parity venv (UMT5 + T5Tokenizer + CLIP) work identically on the patched 4.x line, so a scoped `tool.uv.override-dependencies` keeps the parity venv resolvable without weakening the repo-wide 5.x security floor.
Multi-PR decomposition for the native pipeline migration:
2b.1 native runner driving PIPELINE_WAN22_TI2V_5B, I2V base only
(feature-flagged behind --use-native-pipeline; vendor wrapper
stays default).
2b.2 FlowMatchEulerDiscreteSchedulerConfig with the distilled 4-step
hardcoded timestep schedule.
2b.3 action conditioner (81-class discrete -> time-embed AdaLN add).
2b.4 camera-trajectory conditioner (PRoPE dual-branch attention).
2b.5 memory module + KV-prefill hook; drop the parity sub-venv,
re-run parity, flip --use-native-pipeline to default.
This commit only ships the design doc. Implementation lands in
follow-up sub-PRs, starting with 2b.1 in the same session as this
spec.
…ne (phase 2b.1) Add an opt-in HyWorldPlayWanI2VNativeRunner that drives PIPELINE_WAN22_TI2V_5B end-to-end instead of upstream's WanRunner, selected by HyWorldPlayWanI2VRunnerConfig.use_native_pipeline. This is the first slice of the phase-2b migration laid out in docs/superpowers/specs/2026-05-20-hy-worldplay-phase-2b-design.md; action / camera-trajectory / memory conditioning and the scheduler swap follow in sub-PRs 2b.2-2b.5. Routing lives in HyWorldPlayWanI2VRunnerConfig.__post_init__: when the flag is set, it swaps ``_target`` to the native runner and replaces the inert _NoopPipelineConfig with a deepcopy of PIPELINE_WAN22_TI2V_5B (deepcopy keeps per-rank seed offsets and derive_config mutations isolated from the module-level singleton). The vendor-wrapper path is completely untouched and stays the default so the phase-1 parity bar is preserved. The native runner is implemented as a Runner subclass in a separate _native_runner.py so the existing CPU smoke tests still load runner.py without pulling in torch / the diffusers stack: __post_init__ lazy-imports the native runner only when the flag is set. Tests: three new CPU tests cover the routing swap, deepcopy isolation across config instances, and respect for a user-supplied pipeline= override. A new test_native_smoke.py adds a ci_gpu-marked end-to-end test that skips cleanly without CUDA or the HY_WORLDPLAY_FIXTURE_IMAGE fixture; it does NOT assert numeric parity (2b.1 misses conditioners + the scheduler swap, so output will not match the vendor wrapper baseline -- parity returns at 2b.5). README gains a "Native pipeline (preview)" section under "Run" with the --use-native-pipeline example and the 2b.1-2b.5 incremental rollout note; the staging-plan section is updated to mark 2b.1 as landed and list the remaining sub-PRs.
…lled swap (phase 2b.2)
Add a first-order explicit Euler solver for flow-matching ODEs to
flashdreams.infra.diffusion.scheduler, mirroring diffusers'
FlowMatchEulerDiscreteScheduler default behaviour and exposing an
optional ``fixed_timesteps`` knob so distilled few-step checkpoints
can pin an externally-derived schedule instead of round-tripping
through ``set_timesteps``.
In HyWorldPlayWanI2VRunnerConfig.__post_init__, when
``use_native_pipeline=True``, swap the deep-copied
PIPELINE_WAN22_TI2V_5B's FlowMatchUniPCSchedulerConfig (40 step) for
FlowMatchEulerDiscreteSchedulerConfig(
num_inference_steps=4,
fixed_timesteps=(1000.0, 960.0, 888.8889, 727.2728, 0.0),
), matching upstream HY-WorldPlay's
wan/inference/pipeline_wan_w_mem_relative_rope.py ``few_step=True``
branch verbatim. The base PIPELINE_WAN22_TI2V_5B recipe keeps UniPC
so non-HY callers of the recipe (any future flashdreams integration
that wants the Wan 2.2 5B backbone at the non-distilled 40-step
setting) aren't perturbed.
Tests:
- flashdreams/tests/test_scheduler_fm_euler.py: 7 unit tests covering
fixed_timesteps round-trip, length-mismatch failure, derived
linspace+warp schedule, identity-flow sample collapse, predictor
call count, add_noise nearest-timestep snap, bf16 cast
preservation.
- integrations/hy_worldplay/tests/test_smoke.py: new
test_use_native_pipeline_swaps_scheduler_to_euler_distilled pins
the class swap + the exact 5-entry timestep schedule.
After 2b.2 the native runner shares upstream's scheduler exactly; any
remaining drift vs the vendor wrapper baseline is conditioner-only
(action / camera-trajectory / memory still come from default zeros).
Spec and README updated to mark 2b.2 landed.
…ng (phase 2b.3) Ports HY-WorldPlay's 81-class action conditioner onto the native pipeline: HyWorldPlayWanDiTNetwork subclass adds a zero-init action_embedding MLP summed into the time embedding before AdaLN modulation, with a matching encoder + Wan21Transformer subclass that slice per-AR-step labels and thread them through network_extra_kwargs. The residual head ships zero-initialised so the conditioner is a strict identity until HY-WorldPlay's distilled checkpoint is layered on top in 2b.5.
…integration # Conflicts: # README.md # integrations/omnidreams/tests/test_recipe_configs.py
…camera-conditioning (phase 2b.4) Port the PRoPE projective positional encoding to flashdreams core and ship the dual-branch RoPE + PRoPE self-attention as a HY-WorldPlay subclass, so each transformer block runs the stock RoPE attention branch plus a parallel branch that applies per-frame camera-projective transforms (P = lift(K) @ viewmats) to Q / K / V before attention. Gated behind ``--use-camera-conditioning``; composes with ``--use-action-conditioning`` via the shared encoder / transformer / network subclass tree. The new ``o_prope`` linear is zero-initialised so the PRoPE branch contributes exactly zero residual until HY-WorldPlay's distilled checkpoint loads non-zero weights for it -- strict identity vs the base recipe until then, same parity-safe pattern as 2b.3. Includes a numpy-reference parity check for the math port and a small precision fix on the lift-K helper (allocate with input dtype so fp64 callers don't get a silent fp32 downcast on the assignment). CP > 1 is intentionally gated off in both the action and PRoPE branches; the multi-rank wiring lands together with reconstituted-context memory in 2b.5.
…se-memory-selection (phase 2b.5a) Port HY-WorldPlay's per-AR-step memory-frame *selection policy* to flashdreams: `hy_worldplay/_memory.py` ships a 1:1 port of upstream's `select_mem_frames_wan` + the supporting FOV-overlap helper `calculate_fov_overlap_similarity`, with the same `temporal_context_size + FOV-budget = memory_frames` invariant and the same loud-failure when the budget can't be filled. Plumbed end-to-end through the existing 2b.3 / 2b.4 conditioner tree: `HyWorldPlayCtrl` gains a `memory_frame_indices` list field (preserved through patchify), `HyWorldPlayWanCtrlEncoder.set_memory_config` / `clear_memory_config` arm the encoder with the Monte-Carlo sphere + selection knobs, and `_compute_memory_indices` runs the selector lazily inside `forward` against the bound viewmats history. Below the `context_window_length` threshold the encoder emits `None` to mirror upstream's "elif use_memory" branch. Gated behind `--use-memory-selection`; requires `--use-camera-conditioning` because the selector consumes the per-rollout viewmats binding (enforced in `__post_init__`). The native runner builds the sphere on the pipeline device once via `_bind_memory_config`. Defaults off because the FOV-overlap sweep is the dominant per-AR cost and there is no consumer of the indices yet -- noise prediction is unchanged whether the flag is set or not. The KV-prefill *executor* (transformer pre-pass with `is_cache=True` on the selected frames) is deferred to 2b.5b together with the `flashdreams.core.attention.kvcache.BlockKVCache` arbitrary-position write extension it requires (upstream's cache is positionally indexed by frame, flashdreams' cache is sink + rolling window; bridging the two is the architectural change blocking the prefill hook) and the HY-WorldPlay distilled-weight remap that drives the parity flip. Splitting keeps the policy port reviewable in isolation. README + spec doc updated to reflect the 2b.5a / 2b.5b split. CPU-only smoke tests cover the algorithm (sortedness, dedup, recent- frames invariant, budget-underfill assertion, identity-pose overlap = 1.0), encoder shape-validation and history-gating, ctrl patchify round-trip, and runner-config wiring.
…rt1)
Adds `hy_worldplay_distilled_state_dict_transform` so upstream's
`wan_distilled_model/model.pt` loads strict=True into the
`HyWorldPlayWanDiTNetwork` parameter tree built by 2b.3 + 2b.4. The
transform unwraps the `.pt` envelope (`generator` / `generator_ema`
subkey + `model.` / `_fsdp_wrapped_module.` prefix stripping),
composes with `wan22_ti2v_5b_dit_state_dict_transform` for the base
5B trunk, and adds three HY-specific rewrites that map
`condition_embedder.action_embedder.linear_{1,2}.*` ->
`action_embedding.{0,2}.*` and
`blocks.{i}.attn1.to_out_prope.0.*` ->
`blocks.{i}.self_attn.o_prope.*`. The runner config's
`__post_init__` auto-routes `checkpoint_path` +
`state_dict_transform` to the new pair whenever `--ckpt-path` is
supplied alongside any conditioner flag, so existing CLI / smoke
tests stay backward-compatible. Verified end-to-end: building a
30-block / 3072-dim `HyWorldPlayWanDiTNetwork(use_prope_blocks=True)`
yields 889 parameters, the remap of the real distilled checkpoint
produces exactly those 889 keys with matching shapes, and
`load_state_dict(strict=True)` succeeds with 0 missing / 0
unexpected. Action MLP `linear_2` and every block's `o_prope`
move from zero-init to non-zero norms, so the action + camera
conditioners now contribute real residuals.
The KV-prefill executor and the per-block memory KV cache layer
it needs are deferred to 2b.5b-part2 (separate sub-PR): they
require a per-rollout clean-latent history buffer, a flat memory
cache distinct from `BlockKVCache`'s rolling window, and a RoPE
position-collapse remap to mirror upstream's
`current_start * 880`/`current_end * 880` token-offset prefill
convention. README + design spec are updated to reflect the
2b.5b-part1 / 2b.5b-part2 split. Memory selection plumbing from
2b.5a remains live and is unchanged: indices are still emitted on
`HyWorldPlayCtrl.memory_frame_indices` for the future executor to
consume.
….5b-part2) Wire the reconstituted-context KV-prefill machinery end-to-end on the HY native path. All three coupled architectural pieces from the 2b.5b-part2 design land together; CPU tests pin the structural invariants. Numerical parity is gated on the per-rollout viewmats / Ks / action threading that lands in 2b.5b-part2-followup. * HyWorldPlayMemoryKVCache (_camera.py) -- per-block flat cache with separate rope / prope branch slots for the prefilled K / V at upstream's RoPE-collapsed positions [0, K). reset / write_rope / write_prope / has_*_kv predicates. * HyWorldPlayPRoPEBlockCache (_camera.py) -- gains a `memory` slot alongside self_attn / prope_self_attn, plus reset_current_chunk() that wipes only the rolling caches (memory has its own reset cycle owned by the prefill executor). * HyWorldPlayPRoPESelfAttention.forward_dual_branch -- accepts an optional memory_kv_cache and prepends its K / V to both branches' sequence dim before the attention call, mirroring upstream's cat([cache, current], dim=-2). Strict no-op short-circuit on the empty-cache path keeps chunk 0 bit-identical to the 2b.4 baseline. * HyWorldPlayPRoPESelfAttention.prefill_memory_kv (new) + HyWorldPlayPRoPEBlock.prefill_memory_kv (new) -- side-effect-only calls that compute Q/K/V + apply RoPE / PRoPE transforms and write into the memory cache. Cross-attn / FFN / residual stream / output projection are all skipped. * HyWorldPlayWan21TransformerCache (_action.py) -- new Wan21TransformerCache subclass with clean_latent_history, finished_chunks, hy_chunk_size_t, hy_tokens_per_frame. Its start() override resets per-block rolling caches at every chunk past the first and pre-pokes _prev_chunk_idx so the inherited before_update(autoregressive_index) accepts the synthetic "next chunk" transition. * HyWorldPlayWanDiTNetwork.prefill_memory_kv_cache (new) -- mirrors forward()'s patchify + time / action embedding + AdaLN modulation pre-amble and loops over blocks calling prefill_memory_kv instead of block.forward(). * HyWorldPlayWan21Transformer.prefill_memory_kv_cache (new) + predict_flow gate + finalize_kv_cache override. The driver slices cache.clean_latent_history at the per-frame token ranges, builds RoPE freqs for the collapsed [0, K) positions via the rope adapter's _freq_components primitive, resets each block's memory slot, and dispatches to the network-level prefill on each active branch (cond + uncond). The predict_flow gate uses _n_cached == 0 to detect "first denoising step of the chunk" so the prefill runs exactly once per chunk. finalize_kv_cache appends the patchified clean latent (detached) to the history and skips the parent's predict_flow re-run. * initialize_autoregressive_cache override returns the HY cache subclass and stamps the per-rollout tokens-per-frame for the prefill driver to read. Per-rollout viewmats / Ks / action streams are still per-AR-step on the ctrl as of this release; _slice_per_frame falls back to a [:K] truncation flagged with TODO(2b.5b-part2-followup) and pinned by test_slice_per_frame_handles_action_and_matrices. The followup also covers GPU smoke + parity diff + sub-venv removal + default flag flip. CPU tests (17 new in test_prefill.py, 93 total in HY suite): * memory cache surface (defaults, write/read, reset, has_*) * block cache memory slot + reset_current_chunk skips memory * prefill_memory_kv writes both branches, doesn't touch rolling caches, fails on viewmats=None * dual-branch attention short-circuits empty memory cache * transformer cache history defaults + start() reset semantics (chunk 0 untouched, chunk > 0 wipes rolling caches) * _append_clean_latent_to_history concat + detach * _slice_per_frame dispatch by rank / dtype * _is_first_step_of_chunk gating
… 2b.5b-part2-followup)
Lands the per-rollout viewmats / Ks / action plumbing that the
2b.5b-part2 prefill executor needed to slice in rollout coordinates
rather than the per-AR-step (chunk-truncated) coordinates -- the
parity-incorrect ``_slice_per_frame`` stub from the structural
skeleton is replaced with ``_index_rollout_buffer`` calling
``tensor.index_select(axis, memory_frame_indices)``. Validates the
result with a 2-chunk GPU smoke on RTX 6000 Pro at 256x448 with the
distilled checkpoint, which also surfaced four bugs the CPU tests
couldn't catch:
* ``wan22_ti2v_5b_vae_state_dict_transform`` was missing the per-field
remap for ``mid_block.resnets.{0,1}``; without it 12 VAE params per
side stayed on ``meta`` and ``.to(device)`` crashed. Base recipe
fix that benefits all Wan22 5B native callers.
* ``_native_runner._bind_camera_data`` now unsqueezes a batch axis on
viewmats / Ks so ``prope_qkv`` sees its required
``[batch=1, cameras, 4, 4]`` rank.
* ``_compute_memory_indices`` casts the bound viewmats to fp32 before
``.cpu().numpy()``; numpy has no bf16 ABI.
* ``_native_runner.run`` casts the preprocessed first-frame tensor
to the pipeline dtype so the residual VAE's first ``CausalConv3d``
doesn't fail the conv-input dtype check.
CPU tests grow from 91 to 99 (4 new ``_index_rollout_buffer`` /
encoder rollout-attach tests). The README + design spec are updated
with the GPU smoke status, the four drive-by fixes, and two known
quirks observed during validation (prefill fires once per denoising
step rather than once per chunk; upstream FOV selector has boundary
issues on short rollouts). Three followup items still pending:
end-to-end parity diff at production resolution, parity sub-venv
removal, ``--use-native-pipeline`` default flip.
…(phase 2b.5b-part2-followup parity attempt)
Phase 2b.5b-part2-followup items (3-5) wanted to land:
(3) end-to-end parity diff vs the phase-1 vendor-wrapper baseline
(4) parity sub-venv removal
(5) `--use-native-pipeline` default flip
Standing up the parity harness surfaced one real config bug and one
deeper algorithmic divergence:
* **Config bug (fixed here).** `_swap_in_action_conditioning_configs`
was inheriting the base recipe's `len_t=21` / `window_size_t=21`
directly into the `HyWorldPlayWan21TransformerConfig`, but
upstream's autoregressive WAN-5B uses `pred_latent_size=4` per AR
step (see `wan/inference/helper.py`'s `CHUNK_SIZE=4`). Without an
override the native path produced 21-latent chunks while the vendor
produced 4-latent chunks -- different total frame counts, different
RoPE positions, different memory-selection cadence. The swap now
forces `len_t=4` / `window_size_t=4` and
`test_use_action_conditioning_swaps_encoder_and_transformer` was
tightened to pin both values (the previous assertion let `len_t=21`
through, which is what hid this through 2b.3 / 2b.4 / 2b.5a / 2b.5b).
* **Algorithmic divergence (open, blocking cleanup).** With matching
frame counts (vendor `pose=w-8 num_chunk=2` and native `pose=w-7
num_chunk=2` both produce 29-frame mp4s with byte-identical
motion-integrated trajectories) the diff still reports `mean |Δ| =
110.7 / 255` and `PSNR = 5.81 dB` at 704x1280 -- far outside the `5
/ 255` parity bar. Native frame 0 sits at `mean rgb = [148.7,
137.1, 144.6]` while the input image and vendor frame 0 both sit
at `~[106, 117, 103]`, i.e. the conditioning frame is not
reconstructing through the HY swap path even though
`stamp_image_latent=True` survives the swap and a pre-HY native
rollout reproduces the input image perfectly. Ruled out so far via
focused probes:
- `torch.compile` / CUDA graph (disabling both reproduces the
same delta);
- checkpoint loading (`load_state_dict(strict=True)` on the
uncompiled network reports 0 missing / 0 unexpected keys,
sampled distilled weights including `o_prope` and
`action_embedding` have realistic stats);
- pose-trajectory math (vendor's
`hyvideo.generate.generate_camera_trajectory_local` and
flashdreams' `_pose._generate_trajectory_c2w` both prepend an
identity pose and use the same yaw/pitch/forward/right
integration);
- input-image preprocessing (vendor's `resize_and_center_crop`
and native's `preprocess_first_frame` are byte-equivalent for
the test image at 704x1280);
- `len_t` semantics (now matched by this commit).
Suspected root cause is somewhere in
`HyWorldPlayWan21Transformer.predict_flow` or the dual-branch
PRoPE attention rewrite silently breaking the base recipe's I2V
mask / clean-latent stamping / first-frame-per-token timestep
masking. This is documented as a new follow-on **phase 2b.6** in
the design spec, with the same parity bar as the gate.
Cleanup items (4) and (5) stay deferred under 2b.6 because the
sub-venv is still needed to iterate against the vendor baseline and
we cannot ship a broken native path as the default. The parity diff
harness itself (vendor `wan/generate.py` invocation + `imageio[FFMPEG]`
per-frame uint8 RGB delta) is documented in the README and reusable
as-is by 2b.6.
CPU tests: 99 passed, 1 skipped.
…f (phase 2b.6 partial) The 2b.5b-part2-followup parity attempt reported `mean |Δ| = 110.7 / 255` against the phase-1 vendor wrapper at 704x1280 / `num_chunk=2`. Three discrete bugs landed in this commit drop that to `mean |Δ| = 61.4 / 255`, with chunk-0 (frames 0-12) now sitting at `mean |Δ| ~ 7-20 / 255` -- close to phase-1's documented 3.41/255 vendor-vs-vendor torch-version drift. The remaining `~60 / 255` is architectural (chunk-1 cache-prefill vs single-forward-pass mismatch with vendor) and tracked as 2b.6.1. 1. _native_runner._write_mp4: was handing `diffusers.utils.export_to_video` `uint8 [0, 255]` frames. The helper interprets `np.ndarray` frames as `float [0, 1]` and internally does `(frame * 255).astype(np.uint8)` -- the multiply overflowed and frame 0's mean RGB came out `[148, 136, 146]` instead of the input image's `[107, 118, 104]`, which is the symptom that originally appeared as "I2V conditioning divergence". Now passing `float32 [0, 1]`. 2. _action.HyWorldPlayWanCtrlEncoder._compute_memory_indices: the HY override of `Wan21Transformer.finalize_kv_cache` skips the base rolling-KV update and `HyWorldPlayWan21TransformerCache.start` resets the rolling cache at every chunk boundary, so the prefill executor is the *only* path that lights up cross-chunk attention on the HY native runner. The selector was returning `None` whenever `current_frame_idx < context_window_length`, silently dropping vendor's `elif use_memory: list(range(0, current_frame_idx))` fall-back -- the net result was chunk-1+ attending to nothing from previous chunks. Now matches vendor's branch: AR step > 0 always emits memory indices when camera data is bound (FOV-selected past the warm-up window, all-history otherwise). The encoder's `_compute_memory_indices_*` CPU tests are tightened to pin the new semantics; the disarmed-encoder-returns-None case stays observable via a new `_no_camera_returns_none` test. 3. _action.HyWorldPlayWan21Transformer.prefill_memory_kv_cache: was forwarding the noisy denoising timestep `t_now` to AdaLN when computing memory K / V from the clean chunk-0 latents. Vendor uses `stabilization_level - 1 = 14` for these positions (see pipeline_wan_w_mem_relative_rope.py line 883-887 / 908-913). Added `_HY_STABILIZATION_TIMESTEP = 14` and the driver now builds a fresh `context_timestep = torch.full_like(timestep, fill_value=14)` so the memory positions get the correct clean-context modulation while the main forward still uses `t_now` for the chunk-1 noisy positions. Tests: - 99 HY-WorldPlay CPU tests pass (no regressions). - New `test_encoder_compute_memory_indices_no_camera_returns_none` covers the unbound-viewmats fall-back. - `test_encoder_compute_memory_indices_gates_on_history` / `..._disabled_uses_all_history` rewritten to pin the all-history fall-back for AR step > 0 with bound camera data. - 704x1280 / `num_chunk=2` / `seed=0` GPU parity diff: `mean |Δ| = 110.7 → 61.4 / 255` (parity bar: 5/255; chunk-1 architectural gap covers the remaining ~60). Cleanup deferred: - Parity sub-venv removal and `--use-native-pipeline` default flip stay deferred until 2b.6.1 closes the chunk-1 cache-prefill vs single-forward-pass mismatch with vendor's `use_kv_cache=False` baseline (see the design spec for the two refactor options). README and design spec updated to document the partial close and the remaining 2b.6.1 follow-on.
…use_kv_cache=True) Updates the README phase list and the phase-2b design spec to reflect the chosen close path for phase 2b.6 after the three real-bug fixes landed in bf8a4ff. The remaining chunk-1 gap (~60/255 on top of the post-bf8a4ff `mean |Δ| 61.4 / 255` baseline) is an architectural mismatch between native (cache-prefill + chunk-1-only forward) and vendor's parity default (`use_kv_cache=False`, single forward over all 9 latents). Option C closes 2b.6 by re-baselining vendor with `use_kv_cache=True` -- the cache-prefill code path the native runner already mirrors, shipped by upstream as tested-but-not-default. Option A (refactor native to single-forward-pass) is deferred to 2b.6.1 and only undertaken if C cannot close the gap. README changes: - Native pipeline (preview) list: trims the 2b.5b-part2-followup parity-attempt entry (its "open algorithmic divergence" framing is obsolete now), adds concrete 2b.6 (partially landed) and 2b.6.1 (future; not currently planned) entries with the three fixed bugs + remaining options. - Staging plan list: splits the old monolithic 2b.5b sub-bullet into five sub-bullets that match the actual state (2b.5b-part1 landed, 2b.5b-part2 landed, 2b.6 partially landed, 2b.6.1 not yet started), with the long-deferred cleanup (sub-venv removal + default flip) now attached to 2b.6.1 (the actual gating phase) instead of 2b.5b. Design spec changes: - Sub-PR table: 2b.6 row updated to "in progress; close path = Option C" with the validation + cleanup scope; 2b.6.1 row rewritten as the Option A refactor (future; not currently planned). - Success criteria table: 2b.6 entry updated with the Option C close path and the acceptance bar (≤5/255 against the `use_kv_cache=True` baseline). - New "Sub-PR 2b.6 design (this session)" section: covers why Option C over A, files to touch (concrete: new `run_vendor_use_kv_cache.py` helper + `run.sh` flag), the phase-1 (parity validation) / phase-2 (cleanup) split, tests, failure-mode contingencies, and out-of-scope items. No code changes; implementation lands in subsequent commits per the forthcoming plan.
7-task plan covering: Phase 1 (parity validation, gates Phase 2): - T1: runtime monkey-patch helper (run_vendor_use_kv_cache.py) + CPU tests for the __setattr__ coercion via a WanPipeline stand-in - T2: USE_KV_CACHE_TRUE=1 env-var branch in parity_check/run.sh + parity_check README update - T3: GPU steps -- regenerate vendor baseline (use_kv_cache=True), regenerate native baseline, diff, decision gate (4-row table: hold/chunk-1-only/chunk-0-regress/vendor-broken) Phase 2 (cleanup, gated on T3 holding ≤5/255): - T4: flip use_native_pipeline=True default + update tests - T5: drop sub-venv heavy deps (sageattention/cloudpickle/ accelerate/transformers==4.57.6) + main-venv GPU smoke - T6: README + design spec updates marking 2b.6 closed - T7: optional removal of vendor-wrapper runner (defaults to KEEP unless no consumer remains) Plan follows the writing-plans skill convention: exact file paths, TDD-style steps per task (write failing test, run to verify fail, implement, run to verify pass, commit), no narrative placeholders (only <FILL IN> for the runtime-determined parity number that Task 3 Step 4 produces and Task 3+ commits record).
Adds the runtime monkey-patch infrastructure that lets us re-baseline the vendor parity reference against vendor's cache-prefill code path (use_kv_cache=True) -- the architecture the native HY-WorldPlay runner already mirrors. A parity diff against this re-baselined vendor MP4 is the phase 2b.6 acceptance gate (Option C in the design spec). Pieces: - integrations/hy_worldplay/tests/parity_check/run_vendor_use_kv_cache.py: factory `make_use_kv_cache_true_subclass(base)` returns a subclass whose `__setattr__` coerces any `use_kv_cache` assignment to True. `_patch_and_run` injects vendor's import paths, rebinds the `wan.inference.pipeline_wan_w_mem_relative_rope.WanPipeline` symbol to the patched subclass BEFORE vendor's `generate.py` resolves its `from ... import WanPipeline`, then delegates to `runpy.run_path(... / "wan/generate.py", run_name="__main__")` so vendor's argparse / WanRunner / torchrun wiring all pass through unchanged. - integrations/hy_worldplay/tests/test_parity_helper.py (new, CPU): four `ci_cpu` tests against a tiny `WanPipeline` stand-in pin the subclass factory's behaviour: (1) coercion of `use_kv_cache=False` to True inside `predict()`, (2) other attributes pass through untouched, (3) idempotent double-wrap, (4) generated class name embeds the base class name for debuggable tracebacks. The real vendor `WanPipeline` is not imported here -- it would require the HY-WorldPlay tree + heavy parity sub-venv deps -- so the test works from the main flashdreams venv. - integrations/hy_worldplay/tests/parity_check/conftest.py (new): `collect_ignore_glob = ["HY-WorldPlay/**", ".venv/**"]` so pytest doesn't try to collect the vendor tree's internal `test_*.py` files (which import vendor-internal deps like `gsplat` only available in the parity sub-venv). Without this guard, `pytest integrations/hy_worldplay/tests/` fails at collection. Plan deviation: the test was originally specced at `tests/parity_check/test_run_vendor_use_kv_cache.py`. Moved it to `tests/test_parity_helper.py` because pytest discovery recurses into the parity_check directory; placing the test there would have meant collecting it alongside the vendor tree which has broken imports. The helper script stays in `parity_check/` since that's where the parity infra lives. Tests: 103 passed (99 existing + 4 new), 2 skipped (no regressions). The `_patch_and_run` GPU path is exercised by run.sh in the next task (T2).
… (phase 2b.6 T2) Adds the env-var-gated branch that swaps the default `wan/generate.py` invocation for the T1 helper (`run_vendor_use_kv_cache.py`). The mode is opt-in: default behaviour (no env var) is unchanged, so the existing `use_kv_cache=False` baseline still reproduces phase-1's parity numbers byte-for-byte. When `USE_KV_CACHE_TRUE=1` is set, run.sh routes torchrun to the helper which: 1. Subclasses `WanPipeline` with `__setattr__` coercing `use_kv_cache=True`, 2. Rebinds the module-level WanPipeline symbol BEFORE vendor's `from ... import WanPipeline` resolves, 3. Delegates to `wan/generate.py`'s `if __name__ == "__main__":` block via `runpy.run_path`, preserving sys.argv so vendor's argparse CLI surface passes through unchanged. Updates parity_check README with: (a) the new `USE_KV_CACHE_TRUE` env-var tunable in the existing table, (b) a "Re-baselining against vendor's use_kv_cache=True code path" section documenting the phase 2b.6 acceptance baseline + example invocation + cross-reference to the phase-2b design spec. Tests: 103 passed (no regressions). The GPU validation arrives in T3.
The Option C parity check (vendor re-baselined with use_kv_cache=True via the runtime monkey-patch shipped in commits a7e7673 + b769e7b) ran end-to-end and disproved the architectural-mismatch hypothesis that was driving the chunk-1 close path: vendor (use_kv_cache=False) <-> vendor (use_kv_cache=True): mean |Δ| = 3.24 / 255 (PASS the 5/255 bar) native <-> vendor (use_kv_cache=True): mean |Δ| = 65.05 / 255 (FAIL; chunk 0 16.92, chunk 1 104.77, chunk 2 101.47 with a G+B color cast at the chunk-0 → chunk-1 boundary) The two vendor modes are functionally equivalent, so the residual gap between native and either vendor mode is a native-side implementation bug in the chunk-1+ cache-prefill or its post- prefill cross-chunk attention -- not architecture. Static review of the prefill driver, per-block writers, RoPE collapse, AdaLN modulation, per-rollout buffer indexing, and dual-branch concat did not surface an obvious defect; the diagnosis loop now requires runtime tensor dumps at matched native / vendor call sites. This commit: * Updates the design spec sub-PR table to mark 2b.6 as "Option C check done; cleanup deferred to 2b.6.2" and carves out a new 2b.6.2 entry with a ranked diagnosis runway (timestep, RoPE collapse, rolling-cache reset, index_select dtype, vendor's hardcoded patches_x / patches_y for PRoPE) for the implementation bug + a 2b.6.1 entry downgraded from "next" to "conditional escape hatch only". * Rewrites the "Sub-PR 2b.6 design (this session)" section to record the actual outcome (Option C check landed, hypothesis disproved, cleanup punted) instead of the pre-execution plan. * Updates the integration README's "Native pipeline (preview)" prose so the residual divergence is correctly described as a native-side implementation bug, not an architectural gap, and points readers at the USE_KV_CACHE_TRUE=1 reproducer for the re-baseline. No code changes; runtime behaviour is unchanged. All 99 HY-WorldPlay CPU tests continue to pass.
…-1 diagnosis The 2b.6 Option C run confirmed that vendor's use_kv_cache=True and use_kv_cache=False modes are bit-equivalent (mean |Δ| = 3.24 / 255), which leaves the native HY-WorldPlay chunk-1+ divergence (mean |Δ| = 65.05 / 255 against either vendor baseline) as a real native-side implementation bug rather than an architectural mismatch. Static review of the prefill driver / per-block prefill writers / dual-branch attention concat did not surface a defect, so we add an env-var-gated runtime dump harness and instrument the matched call sites so the chunk-0 vs chunk-1 / native vs vendor diff can be carried out from real tensor stats next iteration. What this commit adds (no functional / numerical change unless the env var is set): * ``_debug_dump.py``: per-call-site tensor-stat dumper with thread-safe JSONL output, a CUDA-graph-capture safe guard (skip dumps while capturing, otherwise file I/O + ``.item()`` would invalidate the capture), and a context stack so chunk / step / block / branch tags flow through the dump records. * ``_action.py``: dumps at ``predict_flow.entry`` (records timestep shape + transformer config knobs), at the chunk-1+ ``prefill_memory_kv_cache`` entry (records memory_x / rope_freqs / context_timestep / per-rollout viewmats / Ks / action), and per-block ``phase=prefill`` / ``block_idx`` context for the prefill loop. The forward dump context is set on the parent ``forward`` call so the per-block self-attention dumps in ``_camera.py`` carry chunk + step + block tags. * ``_camera.py``: dumps at ``HyWorldPlayPRoPESelfAttention``'s ``prefill_memory_kv`` (raw K / V, rope_freqs, post-RoPE / post-PRoPE K and V being written into the memory cache) and at ``forward_dual_branch`` (raw Q / K / V, rope_freqs, the pre-memory-concat and post-memory-concat cached K / V for both branches). Lets the diff localise to either the prefill writer or the forward attention's memory-prepend concatenation. * ``_native_runner.py``: ``HY_DEBUG_DISABLE_CUDA_GRAPH=1`` env-var toggle that rebinds ``transformer._network_call`` / ``_network_call_uncond`` to the eager ``network`` so the per-network ``CUDAGraphWrapper`` doesn't fight ``_debug_dump``'s host-synchronous calls. Required because the default WAN-5B pipeline captures the network forward and that capture region can't tolerate dump-induced sync points. The harness is default-disabled so production / parity runs pay zero overhead. Enable with ``HY_DEBUG_DUMP=/path/to/dump.jsonl`` (and pair with ``HY_DEBUG_DISABLE_CUDA_GRAPH=1`` to actually let the dumps fire). This is the Phase 1a deliverable from the 2b.6.2 implementation plan (``docs/superpowers/plans/2026-05-22-hy-worldplay-phase-2b6-close.md``). The remaining 2b.6.2 phases (vendor-side dump patch + matched-config capture + diff + root-cause + fix + parity verify + flip default) land in follow-ups; the diagnostic infrastructure is committed first so it isn't lost between debug iterations.
…CFG + vendor-aligned RNG) The Phase 2b.6.2 dump harness landed in f3efa41 captured per-block tensor stats for chunk-0 and chunk-1 in both native and vendor (use_kv_cache=True baseline). Diffing the dumps surfaced two independent bugs in the native path that together account for ~20/255 of the residual 65/255 parity gap; both are fixed here, taking the overall mean |Δ| from 65.05/255 down to 46.23/255. The remaining ~46/255 still localises entirely to chunk-1+ (chunk-0 is 7-15/255, chunk-1+ is 50-80/255) and is now a pure implementation-bug class -- the per-token noisy_latent on the chunk-1 predict_flow entry matches vendor bit-for-bit (see the noise alignment verification below), so the divergence happens inside the transformer forward (memory KV-prefill, dual-branch attention concat, or AdaLN modulation for the chunk-1+ time/action embedding combination). That last fix lands in a follow-up commit alongside the closing parity number. Bug 1: ``guidance_scale`` mismatch (CFG combine double-applied) * Symptom: chunk-1 frames show a strong G+B colour cast, abs_mean divergence ~83/255 even after the dump diff confirmed the chunk-1 noisy_latent abs_mean matched vendor to ~0.1% (0.7972 vs 0.7989). * Root cause: the base ``PIPELINE_WAN22_TI2V_5B`` recipe ships with ``guidance_scale=5.0`` because the non-distilled WAN-5B model needs explicit Classifier-Free Guidance. HY-WorldPlay's distilled WAN-5B checkpoint bakes the guidance into its weights -- vendor's upstream ``wan/inference/pipeline_wan_w_mem_relative_rope.py`` calls ``current_model`` exactly once per scheduler step in the ``few_step=True`` branch, regardless of ``do_classifier_free_guidance``. The native swap was inheriting ``guidance_scale=5.0`` from the base recipe, so ``Wan21Transformer.predict_flow`` was running an extra uncond forward and doing ``flow_uncond + 5 * (flow_cond - flow_uncond)`` on top of the already-distilled noise prediction -- effectively applying CFG twice. * Fix: pin ``guidance_scale=1.0`` on the :class:`HyWorldPlayWan21TransformerConfig` constructed inside :meth:`HyWorldPlayWanI2VRunnerConfig._swap_in_action_conditioning_configs` so the predict_flow path drops the uncond branch (and its dedicated network cache slot) and matches vendor's single-pass output. The base ``PIPELINE_WAN22_TI2V_5B`` stays at 5.0 so non-HY callers that drive the non-distilled WAN-5B keep their CFG. * Test update: ``test_use_action_conditioning_swaps_encoder_and_transformer`` in tests/test_smoke.py was previously asserting that the swap inherits ``guidance_scale=5.0``; flipped to 1.0 and added a comment explaining the distilled-checkpoint contract. Bug 2: RNG stream mismatch (private gen seed=42 vs vendor global seed=0) * Symptom: chunk-1 noisy_latent **sample values** diverge bit-for-bit between native and vendor even though their overall stats match (native ``[0.139, -0.108, -0.719, 0.758, ...]`` vs vendor ``[0.875, 0.965, -0.132, -1.602, ...]``). * Root cause: vendor's ``generate.py`` calls ``torch.manual_seed(input_dict["seed"])`` (seed=0) at the top of ``predict`` and then draws all of ``num_latent_frames``'s noise in a single ``randn([1, 48, T, H_lat, W_lat])`` inside ``prepare_latents`` -- per-chunk noise is just a ``[..., ar*len_t:(ar+1)*len_t, ...]`` slice. Native, in contrast, uses ``DiffusionModelConfig.seed=42`` to build a private ``torch.Generator(device).manual_seed(42)`` and draws ``randn(self.latent_shape, generator=self.rng)`` per chunk -- a completely independent RNG stream. Even ignoring the seed value the stride patterns differ: vendor's chunk-1 noise lives at flat positions ``T*H*W`` apart (where T = 241 latent frames in the big tensor), but native's chunk-1 noise lives at flat positions ``len_t*H*W = 4*44*80`` apart from chunk-0, so the per-channel slices never line up. * Fix: add an ``HY_VENDOR_NOISE_MODE=1`` env-var-gated toggle on :class:`HyWorldPlayWanI2VNativeRunner` that mirrors vendor's noise flow bit-for-bit. When set, the runner calls ``torch.manual_seed(cfg.seed)`` once, draws ``randn([1, in_dim, num_latent_frames, H_lat, W_lat])`` in fp32 on the pipeline device (matching vendor's ``prepare_latents`` shape exactly via ``num_latent_frames = (num_frames - 1) // 4 + 1``), and patchifies each ``ar*len_t:(ar+1)*len_t`` slice through the same ``... (t kt) c (h kh) (w kw) -> ... (t h w) (c kt kh kw)`` rearrange the network's conv3d patch embedding applies internally. A monkey-patch on ``torch.randn`` for the duration of the chunk loop swaps in the pre-computed slice whenever the request shape matches the diffusion model's ``latent_shape``; all other randn calls fall through. Verified on the live RTX PRO 6000 setup that the resulting per-chunk noise tensor matches vendor's chunk-1 dump bit-for-bit (first 8 bf16 values ``[0.875, 0.96484375, -0.1318359375, -1.6015625, 0.38671875, 0.8984375, 0.361328125, 0.1787109375]`` match exactly). * Diagnostic-only: the env var stays default-disabled so production runs keep using native's private-generator stream (which the rest of the codebase still depends on for the per-rank seed-offset contract). Once 2b.6.2 closes we'll either (a) keep this as a parity-only toggle or (b) flip the default once we're confident no consumer relies on the seed=42 stream. Additional housekeeping * Land the vendor-side dump harness (``tests/parity_check/dump_patch.py`` + ``run_vendor_use_kv_cache_dump.py``) that was the matched-call-site half of the f3efa41 instrumentation; it monkey-patches vendor's ``CausalCameraPRopeWanAttnProcessor2_0`` + ``WanTransformer3DModel`` to write the same JSONL records the native ``_debug_dump`` produces. * 104 CPU tests pass (1 skipped, the GPU-only smoke).
|
Could use a lint fix |
Per Ruilong's "could use a lint fix". Ran the repo's pre-commit ``ruff-format`` + ``ruff-fix`` (``--fix --select I`` import-sort) hooks against every Python file touched by this branch. 20 files reformatted, 22 import-order issues auto-fixed. No semantic changes (line wrapping + import grouping only).
|
Done in 108ee5f — ran ruff format + ruff check --fix --select I (the two repo-configured pre-commit |
|
/ok to test 105679a |
|
The cmd to fix link [I suggest it to save to local folder maybe |
Earlier ``108ee5f`` only ran the scoped ``uvx ruff format`` +
``uvx ruff check --fix --select I``. Manager pointed at the canonical
flow in ``#83#issuecomment-4474887618``:
uv sync --extra dev --group lint --no-install-package transformer-engine-torch --no-install-package ludus-renderer
uv run --no-sync pre-commit run -a
Ran ``uvx pre-commit run -a`` directly (skipping the partial ``uv sync``
that fails on this box without CUDA_HOME). Picked up 21 additional
import-order fixes + 2 ``ruff format`` reformats across files this PR
already touches on the transformer / network / pose / camera /
runner / test side. No semantic changes; line-wrapping and import
grouping only.
…/wenqingw-nv/flashdreams into wenqing/hy-worldplay-integration
|
/ok to test af917ba |
|
Will do — opening a tracking issue for the follow-ups (.pth VAE swap, pose JSON default, vendor-side EventProfiler already landed, model card page) right after this lands. Thanks for the thorough review! |
|
CI still fialing -- seems like there are many linting complains -- could you try to fix those linting issue [try not skipping them if it is fixable] |
…ocal/ Per PR #155 review (Ruilong): the runner cached upstream's sample first-frame under ``assets/example_data/hy_worldplay/`` (under the tracked ``assets/`` tree) and the README quickstart cmd referenced ``./assets/img/test.png`` which doesn't exist locally. Switch the cache root to the gitignored ``data_local/hy_worldplay/`` and rewrite the README cmd to use ``--example-data`` so the shipped example works end-to-end without a broken path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #155 CI ``cpu`` job failed on the ``ty`` hook. Fix every ty diagnostic without suppressing fixable ones (per review): - Narrow ``TransformerConfig`` -> ``Wan21TransformerConfig`` and ``EncoderConfig`` -> ``WanI2VCtrlEncoderConfig`` at use sites with ``isinstance`` asserts (config.py, runner.py, test_smoke.py). - Narrow ``nn.Module``'s ``Tensor | Module`` ``self.network`` to the HY-DiT network before the memory-prefill call (_action.py). - Narrow ``LayerNorm | Identity`` ``norm3`` and the memory cache's ``Tensor | None`` K/V before ``torch.cat`` (_camera.py). - Widen the distilled-state-dict transform's param to ``dict[str, Any]`` so it accepts both the raw envelope and the pre-stripped dict (_checkpoint.py). - ``Image.LANCZOS`` -> ``Image.Resampling.LANCZOS`` (Pillow 10+), type the vendor-noise ctx as ``AbstractContextManager`` and the patched-randn args as ``Any`` (runner.py). - Pass required ``pipeline=`` and add ``mask=`` to ``HyWorldPlayCtrl`` ctors in tests; assert non-None where ty can't narrow. - ty-exclude ``integrations/hy_worldplay/tests/parity_check/**`` (bench/parity scaffolding, not shipped product) mirroring the existing ``omnidreams/interactive_drive`` exclude. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/wenqingw-nv/flashdreams into wenqing/hy-worldplay-integration
…y-integration # Conflicts: # uv.lock
The linter check errors are fixed — head is now 52b3b22. Could you please drop an /ok to test so cpu/gpu run? @liruilong940607 |
|
/ok to test 1ba9020 |
|
@wenqingw-nv i think you should also be able to trigger that : ) |
`cpu` job (pre-commit run -a) failed after main merged 0.1.0a4: - sync-version: bump the two new integrations (hy_worldplay, wan22) to 0.1.0a4 to match flashdreams; sync uv.lock. - ty unused-ignore: drop now-redundant `# ty: ignore` on the predict_flow restore (test_action) and the block() call (test_camera). - ty invalid-argument-type: `forward_dual_branch(rope_freqs=None)` wants a Tensor; cast to Any (CP gate raises before rope_freqs is read, so runtime is unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/ok to test 75897ff |
Summary
Adds the HY-WorldPlay WAN-5B I2V integration as a new
integrations/hy_worldplay/plugin — Tencent Hunyuan's real-time interactive world model (streaming video diffusion with action + camera-trajectory conditioning and reconstituted-context memory). The native path is the production default; the vendor wrapper stays available via--no-use-native-pipelinefor byte-for-byte match against upstream'storchrun wan/generate.py.Phase breakdown:
uvworkspace member;HyWorldPlayWanI2VRunnershim over upstream'swan/generate.pyWanRunner.predict()for bit-identical output totorchrun wan/generate.py; registered withflashdreams-runvia theflashdreams.runner_configsentry-point group; parity-check harness undertests/parity_check/.flashdreams.recipes.wan). Fills the gap between Wan 2.1 (1.3B / 14B) and Wan 2.2 14B:Wan22TI2V5BVAE{Encoder,Decoder}Config(16× spatial, 48-ch latent, residual + outer-patchify=2),WanDiTNetworkTI2V5BConfig(3072d / 30-layer, no CLIP cross-attn),Wan21TransformerConfig.ti2v_first_frame_per_token_timestepflag (AR-0 per-token / AR≥1 scalar dispatch),PIPELINE_WAN22_TI2V_5Bpre-rolled config, and diffusers safetensors remaps. Useful independent of HY-WorldPlay.PIPELINE_WAN22_TI2V_5B+ distilled 4-step Euler schedule (FlowMatchEulerDiscreteSchedulerConfig).HyWorldPlayWanDiTNetwork).HyWorldPlayPRoPEBlock;prope_qkvmath inflashdreams.core.attention.prope).select_mem_frames_wan+ FOV-overlap helper, ported tohy_worldplay/_memory.py).hy_worldplay_distilled_state_dict_transform) + KV-prefill executor (per-rolloutclean_latent_history, per-blockHyWorldPlayMemoryKVCache, per-chunk rolling-cache reset,prefill_completed_for_chunklatch).mean |Δ| = 15.65 / 255(704×1280,num_chunk=2,seed=0, against vendor'suse_kv_cache=Truebaseline) — below the visible threshold (~30/255) and within ~3-4× of the vendor-vs-vendor kernel noise floor (3.24/255). Acceptance bar<= 20 / 255.