Skip to content

perf ticket 010: async compute for post-FX (blocked on wgpu multi-queue) #29

@proggeramlug

Description

@proggeramlug

Deferred perf ticket — see docs/perf/010-async-compute.md.

Summary

Run post-FX passes (SSAO / SSR / SSGI / bloom) on a dedicated compute queue in parallel with the next frame's graphics work (shadow + main HDR). UE5 uses this pattern to hide ~20% of post-FX latency. Expected gain on Sponza: ~1.3 ms of the 16.7 ms vsync budget.

Why deferred — upstream blocker

Audited wgpu 29 (current pin): Adapter::request_device returns exactly one (Device, Queue). There is no Device::get_compute_queue() / Instance::request_multiple_queues / queue-family API. The only concrete paths to actually implement this today:

  1. Drop to wgpu-hal directly for the second queue. wgpu-hal has per-backend Queue abstractions (Metal / DX12 / Vulkan each support multiple queues at the hal level), but mixing wgpu-core and wgpu-hal in the same renderer is fragile — lifetime and submission-ordering guarantees differ, and we'd lose the safe wgpu-core API for every resource the compute queue touches. Effectively rewrites the post-FX layer on a different abstraction. ~2-3 weeks.
  2. Native per-platform: metal-rs on macOS, windows-rs / DX12 on Windows, ash / Vulkan on Linux + Android. Three separate implementations, each with its own sync primitives. ~3+ weeks.
  3. Wait for wgpu upstream. Multi-queue support has been discussed but is not on a near-term roadmap as of wgpu 29.

The ticket's own sub-suggestion — "prototype serial-equivalent ordering first" (split encoders, same queue) — doesn't help: on a single queue, one big encoder + one submit generally outperforms multiple smaller submits because every submit introduces driver overhead without enabling parallelism.

Reopen criteria

  • A target scene pushes past the 16.7 ms vsync ceiling and post-FX is the bottleneck.
  • wgpu lands a stable multi-queue API upstream (track wgpu releases).

Why not worth a multi-week redesign today

Estimated gain is ~1.3 ms of a 16.7 ms budget on a scene that's already vsync-capped at 60 fps. The saved milliseconds would be invisible behind the vsync cap. Not worth the cross-platform correctness risk until we have a scene that actually pushes past the ceiling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions