Deferred perf ticket — see docs/perf/010-async-compute.md.
Summary
Run post-FX passes (SSAO / SSR / SSGI / bloom) on a dedicated compute queue in parallel with the next frame's graphics work (shadow + main HDR). UE5 uses this pattern to hide ~20% of post-FX latency. Expected gain on Sponza: ~1.3 ms of the 16.7 ms vsync budget.
Why deferred — upstream blocker
Audited wgpu 29 (current pin): Adapter::request_device returns exactly one (Device, Queue). There is no Device::get_compute_queue() / Instance::request_multiple_queues / queue-family API. The only concrete paths to actually implement this today:
- Drop to wgpu-hal directly for the second queue. wgpu-hal has per-backend
Queue abstractions (Metal / DX12 / Vulkan each support multiple queues at the hal level), but mixing wgpu-core and wgpu-hal in the same renderer is fragile — lifetime and submission-ordering guarantees differ, and we'd lose the safe wgpu-core API for every resource the compute queue touches. Effectively rewrites the post-FX layer on a different abstraction. ~2-3 weeks.
- Native per-platform:
metal-rs on macOS, windows-rs / DX12 on Windows, ash / Vulkan on Linux + Android. Three separate implementations, each with its own sync primitives. ~3+ weeks.
- Wait for wgpu upstream. Multi-queue support has been discussed but is not on a near-term roadmap as of wgpu 29.
The ticket's own sub-suggestion — "prototype serial-equivalent ordering first" (split encoders, same queue) — doesn't help: on a single queue, one big encoder + one submit generally outperforms multiple smaller submits because every submit introduces driver overhead without enabling parallelism.
Reopen criteria
- A target scene pushes past the 16.7 ms vsync ceiling and post-FX is the bottleneck.
- wgpu lands a stable multi-queue API upstream (track wgpu releases).
Why not worth a multi-week redesign today
Estimated gain is ~1.3 ms of a 16.7 ms budget on a scene that's already vsync-capped at 60 fps. The saved milliseconds would be invisible behind the vsync cap. Not worth the cross-platform correctness risk until we have a scene that actually pushes past the ceiling.
Deferred perf ticket — see docs/perf/010-async-compute.md.
Summary
Run post-FX passes (SSAO / SSR / SSGI / bloom) on a dedicated compute queue in parallel with the next frame's graphics work (shadow + main HDR). UE5 uses this pattern to hide ~20% of post-FX latency. Expected gain on Sponza: ~1.3 ms of the 16.7 ms vsync budget.
Why deferred — upstream blocker
Audited wgpu 29 (current pin):
Adapter::request_devicereturns exactly one(Device, Queue). There is noDevice::get_compute_queue()/Instance::request_multiple_queues/ queue-family API. The only concrete paths to actually implement this today:Queueabstractions (Metal / DX12 / Vulkan each support multiple queues at the hal level), but mixing wgpu-core and wgpu-hal in the same renderer is fragile — lifetime and submission-ordering guarantees differ, and we'd lose the safe wgpu-core API for every resource the compute queue touches. Effectively rewrites the post-FX layer on a different abstraction. ~2-3 weeks.metal-rson macOS,windows-rs/ DX12 on Windows,ash/ Vulkan on Linux + Android. Three separate implementations, each with its own sync primitives. ~3+ weeks.The ticket's own sub-suggestion — "prototype serial-equivalent ordering first" (split encoders, same queue) — doesn't help: on a single queue, one big encoder + one submit generally outperforms multiple smaller submits because every submit introduces driver overhead without enabling parallelism.
Reopen criteria
Why not worth a multi-week redesign today
Estimated gain is ~1.3 ms of a 16.7 ms budget on a scene that's already vsync-capped at 60 fps. The saved milliseconds would be invisible behind the vsync cap. Not worth the cross-platform correctness risk until we have a scene that actually pushes past the ceiling.