Deferred perf ticket — see docs/perf/008-visibility-buffer.md.
Summary
Replace Bloom's current 4-MRT G-buffer (18 bytes/pixel written per fragment) with a Nanite-style visibility buffer: store only (triangle_id, u, v, mesh_id) at ~8 bytes/pixel, defer full PBR shading to a second pass that fetches vertex data from shared storage buffers. Expected gain: ≥ 50% fragment-bandwidth reduction, plus "every visible pixel shades exactly once" when combined with depth prepass.
Why deferred
Real GPU bandwidth win (~14 MB/frame saved at 1600×900 × overdraw factor, on a benchmark that currently writes 26 MB/pass) but invisible behind the vsync cap on Sponza. Main perf target (60 fps at full visual quality) is already met; any further bandwidth reduction just gives headroom we can't measure on the current benchmark machine.
Reopen criteria
- A target scene pushes past the 16.7 ms vsync ceiling on the benchmark machine. Remaining GPU-side lever for bandwidth-bound scenes.
- Integrated / mobile GPUs become a priority. Bandwidth matters disproportionately more on tile-based and integrated hardware; this ticket is the single biggest available reduction.
- Overdraw-heavy scenes (foliage, hair, transparent-dense particles) become the target.
Prerequisites
- Ticket 009 (unified vertex + index buffers + per-mesh descriptor buffer) is a hard prerequisite — the shading pass needs a single bindless-style fetch across all meshes.
- Ticket 005 (depth prepass) becomes useful again at that point; land alongside.
Effort
~2+ weeks for the baseline redesign: main_hdr_pass output becomes Rgba32Uint (tri_id, u, v, mesh_id) only, new shading pass evaluates PBR from storage-buffer vertex fetches, downstream MRT consumers (SSR / SSGI / SSAO / post-FX) rewired to read from the rebuilt material channels.
Quick-win intermediate (still deferred, ~2 days)
The ticket also documents a simpler intermediate step: drop unused MRTs when the dependent post-FX is disabled (velocity_rt only needed with TAA / motion blur; albedo_rt only needed with SSGI / SSR; material_rt only needed for SSR). That's a 30-50 % MRT bandwidth cut specifically for low-quality modes on integrated hardware — worth doing when targeting those adapters.
Deferred perf ticket — see docs/perf/008-visibility-buffer.md.
Summary
Replace Bloom's current 4-MRT G-buffer (18 bytes/pixel written per fragment) with a Nanite-style visibility buffer: store only
(triangle_id, u, v, mesh_id)at ~8 bytes/pixel, defer full PBR shading to a second pass that fetches vertex data from shared storage buffers. Expected gain: ≥ 50% fragment-bandwidth reduction, plus "every visible pixel shades exactly once" when combined with depth prepass.Why deferred
Real GPU bandwidth win (~14 MB/frame saved at 1600×900 × overdraw factor, on a benchmark that currently writes 26 MB/pass) but invisible behind the vsync cap on Sponza. Main perf target (60 fps at full visual quality) is already met; any further bandwidth reduction just gives headroom we can't measure on the current benchmark machine.
Reopen criteria
Prerequisites
Effort
~2+ weeks for the baseline redesign: main_hdr_pass output becomes
Rgba32Uint (tri_id, u, v, mesh_id)only, new shading pass evaluates PBR from storage-buffer vertex fetches, downstream MRT consumers (SSR / SSGI / SSAO / post-FX) rewired to read from the rebuilt material channels.Quick-win intermediate (still deferred, ~2 days)
The ticket also documents a simpler intermediate step: drop unused MRTs when the dependent post-FX is disabled (velocity_rt only needed with TAA / motion blur; albedo_rt only needed with SSGI / SSR; material_rt only needed for SSR). That's a 30-50 % MRT bandwidth cut specifically for low-quality modes on integrated hardware — worth doing when targeting those adapters.