Resource Synchronization by 3Shain · Pull Request #73 · 3Shain/dxmt

3Shain · 2025-06-25T02:41:23Z

This PR implements:

Fence-based synchronization: completely get rid of automatic hazard tracking behavior of Metal 3
Intrapass barriers

The old implementation

#51 introduces a "half" hazard tracking based on Partitioned Bloom Filter, and it's only used to provide information about "whether two encoders' order can be swapped" to find more opportunities of encoder coalescing optimization, whereas the Metal 3 driver will still perform its own hazard resolving, thus essentially it is doing double work. Also there is no way to temporarily opt out automatic hazard tracking even when it can be proved that multiple accesses can be safely overlapped.

The new implementation

Therefore a new fence-based synchronization is implemented in this PR.

Intro

First of all let's focus on synchronization of encoders and ignore intrapass barriers. Let's say we have encoder A and B. B is encoded after A. They both access the same resource R.

If A reads R and B also reads R, no fence required between them.
If A reads R and B writes R, then we need a fence to resolve a write-after-read (WaR) hazard.
Similarly, there are also RaW and WaW hazards to resolve.

Then it is straightforward to implement such algorithm:

Assign each resource two states: the last write encoder last_write, and a list of all previous read encoders last_reads
When encoder reads a resource:
- (RaW) Insert a fence between last_write and current encoder
- Add current encoder to last_reads
When encoder writes a resource
- (WaW) Insert a fence between last_write and current encoder, and set last_write to current encoder
- (WaR) Insert a fence between all last_reads and current encoder, and clear last_reads

What if a encoder firstly writes then reads a resource? Well that should be handled by intrapass barrier, but we ignore it at the moment and assume an encoder would either read or write a resource. Although there is a more practical problem to consider: what is the maximum size of last_reads list, if a resource is never written? Ideally the app should create immutable resource, but it's also not uncommon to have a normal resource that is occasionally updated by blit operation and frequently read from shaders. Thankfully, there is an upper bound in theory: because we don't need to insert a fence if the encoder is known to be completed.

Lanes & Parity

Let's think from a different angle: what's the number of encoders that can be concurrently executed on GPU? In practice this is often a single-digit number, not because of hardware capabilities, but the nature of workload: most of work introduces data dependency, and other highly parallelizable works often have a high shader occupancy (meaning the GPU can utilize most/all of cores to execute the workload). Then we can assume a reasonable number of maximum concurrent encoders K ( in code: kLane, value is 64 in this implementation), and introduce the concept of lane and parity.

A lane is a virtual timeline, there will be K lane and we round-robin assign each encoder a lane in commit order. Each encoder is also assigned a fence (MTLFence) that it will unconditionally update, so any following encoders depending on its work can just wait that fence. All previous encoders of one encoder, up to the previous nearest encoder in the same lane (not included), is in the same parity (so a parity is a set containing K encoders). And similarly, we can define the previous parity. Every encoder will unconditionally wait all K encoders (' fence) in previous parity (including previous nearest encoder in the same lane), and conditionally wait encoders in current parity if there is a hazard/dependency.

In this setup, even though every encoder has no data dependency with others, a maximum K-concurrency is still maintained. Now back to the question of how many encoders the last_reads should store: the answer is just K, because at most K encoders can be concurrently executed, so last_reads only needs to maintain encoders in current parity.

The total number of parities P (in code: kParity) will also affect the behavior of algorithm. We've mentioned "previous parity", meaning there should be at least 2 parities, but a single parity actually works as well - it just make the whole dependency graph degenerated to a single linked list. So parity of 2 is good? It turns out P>=3 is even better.

`FenceSet` and encoder reordering

So our algorithm only needs K*P fences (MTLFence) in total if we reuse fences. It's such a small number that we can simply use bitset to maintain wait_fences/update_fences of an encoder. Bitset makes union and intersection extremely cheap to compute, and these result are particularly useful for checking if two encoders have data hazard and merging two encoders when possible.

Now back to the reason P>=3 is better: encoder reordering is only possible when P>2. When P=2, one encoder's next encoder will update the fence that is also updated in previous parity and thus waited by the encoder, so any adjacent encoders cannot be swapped.

The old Partitioned Bloom Filter implementation now is removed, we have deterministic dependency information now.

Fence simplification (`FenceLocalityCheck`)

If A wait B and B wait C, then A don't have to wait C. FenceLocalityCheck implements such simplification by recording last K*P wait fences and some magical bitwise operations. This is not a necessary step for correctness, but it is cheap so let's do it anyway.

Render encoders

Render encoder adds extra complexity because vertex (non-fragment) and fragment stages are different synchronization scopes. Basically the solution is to introduce an additional virtual encoder.

Barriers

Now we finally consider intrapass barriers.

The idea is still simple, essentially can be demonstrated by a state machine:

State	Event Read (action & next state)	Event Write (action & next state)
Initial	Readonly	Written
Readonly	Readonly	Written + barrier
Written	ReadAfterWritten + barrier	Written + barrier
ReadAfterWritten	ReadAfterWritten	Written + barrier

These states are virtual, inferred from other states (see GenericAccessTracker).

And of course render encoder adds complexity: the state machine now have 16 states with 4 actions (read & write for pre-raster/fragment stages)

Synchronization Granularity

Buffer: whole logical buffer (range/sub-allocation of physical buffer)
Texture: sub-resource, or whole resource if it's never bound as RTV/DSV/UAV

Performance

It is expected to not introduce any regression in performance.

Further works

These works are now made possible

Residency Set: to further reduce CPU overhead
{Begin|End}UAVOverlap in nvapi
Migration to Metal 4

K0bin · 2025-06-25T02:48:59Z

I'm curious: what's making you go for manual tracking when you're implementing an API with automatic barriers on top of an API with automatic barriers?

3Shain · 2025-06-25T03:45:29Z

I'm curious: what's making you go for manual tracking when you're implementing an API with automatic barriers on top of an API with automatic barriers?

A prerequisite for resource suballocation. Currently every buffer allocation takes a minimum of 4kB (1 page) memory, which is not ideal especially on WoW64, thus to suballocate from a large buffer is desirable. However, with automatic tracking, it can create false dependency when two encoder read from/write to non-overlapping ranges of the same buffer.
make Begin/EndUAVOverlap implementation possible
it's faster, so why not?

K0bin · 2025-06-25T12:43:14Z

The sub allocation one is definitely a good point. I always thought that the Metal 3 fence API would be too much of a mess to manage.

Also make all resource untracked

…ormation

So that write data hazards can be detected.

…rier for fragment-to-fragment stage

Because depth or stencil might be readonly and bound as SRV. With initial write access it introduces false barriers on read.

Note it doesn't apply to texture that is read from shader only (although it can still be a blit target) to reduce CPU overhead.

3Shain marked this pull request as ready for review July 1, 2025 10:50

3Shain force-pushed the refactor/fence-synchronization branch 2 times, most recently from c21758b to 22d350f Compare July 4, 2025 14:26

3Shain force-pushed the main branch from bf2e070 to 4337147 Compare August 4, 2025 11:36

3Shain force-pushed the refactor/fence-synchronization branch from 22d350f to 56ea8bd Compare September 17, 2025 01:38

3Shain force-pushed the refactor/fence-synchronization branch 2 times, most recently from 6671b9b to a9300cf Compare December 17, 2025 09:07

3Shain force-pushed the refactor/fence-synchronization branch 2 times, most recently from 93ebb08 to 2a82557 Compare February 11, 2026 06:59

3Shain force-pushed the refactor/fence-synchronization branch from 2a82557 to e09ca6c Compare March 20, 2026 14:51

3Shain force-pushed the refactor/fence-synchronization branch 3 times, most recently from fd50b91 to 3622738 Compare April 1, 2026 12:32

3Shain mentioned this pull request Apr 3, 2026

D3D11 Hazard Tracking #142

Merged

3Shain force-pushed the refactor/fence-synchronization branch from 3622738 to b2c13d6 Compare April 6, 2026 23:05

3Shain force-pushed the main branch from c696ca9 to 7130169 Compare April 6, 2026 23:41

3Shain force-pushed the refactor/fence-synchronization branch 3 times, most recently from 1e03c27 to 9ac2073 Compare April 8, 2026 20:41

3Shain changed the title ~~WIP: Fence~~ Resource Synchronization Apr 8, 2026

refactor(dxmt): remove PBF based synchronization

13e1cef

3Shain force-pushed the refactor/fence-synchronization branch 3 times, most recently from 15045d9 to 2c90411 Compare April 13, 2026 08:10

3Shain added 3 commits April 13, 2026 16:24

feat(dxmt): implement fence-based synchronization

d23742b

Also make all resource untracked

feat(dxmt): track buffer access at sub-allocation level

b54d647

feat(dxmt): re-implement encoder reorder & coalesc, base on fence inf…

aabd30b

…ormation

3Shain added 9 commits April 13, 2026 16:24

feat(dxmt): implement encoder memory barrier

82ad3f1

fix(d3d11): always encode resource binding if UAV is bound

d09b62e

So that write data hazards can be detected.

feat(dxmt, d3d11): implement empty tile shader dispatch as memory bar…

3557c7d

…rier for fragment-to-fragment stage

feat(dxmt): implement more render encoder memory barriers

6876391

fix(dxmt, d3d11): defer write access tracking for DSV

9bee808

Because depth or stencil might be readonly and bound as SRV. With initial write access it introduces false barriers on read.

refactor(dxmt, d3d11): use 64-bit view key/id

bd9eead

feat(dxmt): embed texture view mips & array extent into view key/id

941003d

feat(dxmt): track texture access at sub-resource level

f31f0f2

Note it doesn't apply to texture that is read from shader only (although it can still be a blit target) to reduce CPU overhead.

feat(d3d11): hint backend a texture is ShaderReadonly

92cb1bb

3Shain force-pushed the refactor/fence-synchronization branch from 2c90411 to 92cb1bb Compare April 13, 2026 08:24

3Shain merged commit 5e886a1 into main Apr 13, 2026
15 checks passed

3Shain mentioned this pull request Apr 13, 2026

Pixel shader UAV barrier #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource Synchronization#73

Resource Synchronization#73
3Shain merged 13 commits intomainfrom
refactor/fence-synchronization

3Shain commented Jun 25, 2025 •

edited

Loading

Uh oh!

K0bin commented Jun 25, 2025

Uh oh!

3Shain commented Jun 25, 2025

Uh oh!

K0bin commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3Shain commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The old implementation

The new implementation

Intro

Lanes & Parity

FenceSet and encoder reordering

Fence simplification (FenceLocalityCheck)

Render encoders

Barriers

Synchronization Granularity

Performance

Further works

Uh oh!

K0bin commented Jun 25, 2025

Uh oh!

3Shain commented Jun 25, 2025

Uh oh!

K0bin commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3Shain commented Jun 25, 2025 •

edited

Loading

`FenceSet` and encoder reordering

Fence simplification (`FenceLocalityCheck`)