Minimal single-GPU Wan video-DiT inference + TeaCache / First-Block-Cache step-skipping, extracted from xDiT with all distributed/sequence-parallel machinery removed.
Highlights:
apply_cache_on_transformerhooks a diffusers WanTransformer3DModel via a block-stack wrapper (TeaCache uses e0/e signal; FBCache uses the first-block residual).NanoWanPipeline: explicit, instrumentable denoising loop driving per-CFG-branch caches.- Verified on Wan2.1-T2V-1.3B (480x832, 33f, 30 steps): 1.57x @ thr=0.1 (37% skip), 2.26x @ thr=0.2 (57% skip). Forced-compute is bit-exact vs no-cache.
import nanoxdit