Skip to content

Resource Synchronization#73

Merged
3Shain merged 13 commits intomainfrom
refactor/fence-synchronization
Apr 13, 2026
Merged

Resource Synchronization#73
3Shain merged 13 commits intomainfrom
refactor/fence-synchronization

Conversation

@3Shain
Copy link
Copy Markdown
Owner

@3Shain 3Shain commented Jun 25, 2025

This PR implements:

  • Fence-based synchronization: completely get rid of automatic hazard tracking behavior of Metal 3
  • Intrapass barriers

The old implementation

#51 introduces a "half" hazard tracking based on Partitioned Bloom Filter, and it's only used to provide information about "whether two encoders' order can be swapped" to find more opportunities of encoder coalescing optimization, whereas the Metal 3 driver will still perform its own hazard resolving, thus essentially it is doing double work. Also there is no way to temporarily opt out automatic hazard tracking even when it can be proved that multiple accesses can be safely overlapped.

The new implementation

Therefore a new fence-based synchronization is implemented in this PR.

Intro

First of all let's focus on synchronization of encoders and ignore intrapass barriers. Let's say we have encoder A and B. B is encoded after A. They both access the same resource R.

  • If A reads R and B also reads R, no fence required between them.
  • If A reads R and B writes R, then we need a fence to resolve a write-after-read (WaR) hazard.
  • Similarly, there are also RaW and WaW hazards to resolve.

Then it is straightforward to implement such algorithm:

  • Assign each resource two states: the last write encoder last_write, and a list of all previous read encoders last_reads
  • When encoder reads a resource:
    • (RaW) Insert a fence between last_write and current encoder
    • Add current encoder to last_reads
  • When encoder writes a resource
    • (WaW) Insert a fence between last_write and current encoder, and set last_write to current encoder
    • (WaR) Insert a fence between all last_reads and current encoder, and clear last_reads

What if a encoder firstly writes then reads a resource? Well that should be handled by intrapass barrier, but we ignore it at the moment and assume an encoder would either read or write a resource. Although there is a more practical problem to consider: what is the maximum size of last_reads list, if a resource is never written? Ideally the app should create immutable resource, but it's also not uncommon to have a normal resource that is occasionally updated by blit operation and frequently read from shaders. Thankfully, there is an upper bound in theory: because we don't need to insert a fence if the encoder is known to be completed.

Lanes & Parity

Let's think from a different angle: what's the number of encoders that can be concurrently executed on GPU? In practice this is often a single-digit number, not because of hardware capabilities, but the nature of workload: most of work introduces data dependency, and other highly parallelizable works often have a high shader occupancy (meaning the GPU can utilize most/all of cores to execute the workload). Then we can assume a reasonable number of maximum concurrent encoders K ( in code: kLane, value is 64 in this implementation), and introduce the concept of lane and parity.

A lane is a virtual timeline, there will be K lane and we round-robin assign each encoder a lane in commit order. Each encoder is also assigned a fence (MTLFence) that it will unconditionally update, so any following encoders depending on its work can just wait that fence. All previous encoders of one encoder, up to the previous nearest encoder in the same lane (not included), is in the same parity (so a parity is a set containing K encoders). And similarly, we can define the previous parity. Every encoder will unconditionally wait all K encoders (' fence) in previous parity (including previous nearest encoder in the same lane), and conditionally wait encoders in current parity if there is a hazard/dependency.

In this setup, even though every encoder has no data dependency with others, a maximum K-concurrency is still maintained. Now back to the question of how many encoders the last_reads should store: the answer is just K, because at most K encoders can be concurrently executed, so last_reads only needs to maintain encoders in current parity.

The total number of parities P (in code: kParity) will also affect the behavior of algorithm. We've mentioned "previous parity", meaning there should be at least 2 parities, but a single parity actually works as well - it just make the whole dependency graph degenerated to a single linked list. So parity of 2 is good? It turns out P>=3 is even better.

FenceSet and encoder reordering

So our algorithm only needs K*P fences (MTLFence) in total if we reuse fences. It's such a small number that we can simply use bitset to maintain wait_fences/update_fences of an encoder. Bitset makes union and intersection extremely cheap to compute, and these result are particularly useful for checking if two encoders have data hazard and merging two encoders when possible.

Now back to the reason P>=3 is better: encoder reordering is only possible when P>2. When P=2, one encoder's next encoder will update the fence that is also updated in previous parity and thus waited by the encoder, so any adjacent encoders cannot be swapped.

The old Partitioned Bloom Filter implementation now is removed, we have deterministic dependency information now.

Fence simplification (FenceLocalityCheck)

If A wait B and B wait C, then A don't have to wait C. FenceLocalityCheck implements such simplification by recording last K*P wait fences and some magical bitwise operations. This is not a necessary step for correctness, but it is cheap so let's do it anyway.

Render encoders

Render encoder adds extra complexity because vertex (non-fragment) and fragment stages are different synchronization scopes. Basically the solution is to introduce an additional virtual encoder.

Barriers

Now we finally consider intrapass barriers.

The idea is still simple, essentially can be demonstrated by a state machine:

State Event Read (action & next state) Event Write (action & next state)
Initial Readonly Written
Readonly Readonly Written + barrier
Written ReadAfterWritten + barrier Written + barrier
ReadAfterWritten ReadAfterWritten Written + barrier

These states are virtual, inferred from other states (see GenericAccessTracker).

And of course render encoder adds complexity: the state machine now have 16 states with 4 actions (read & write for pre-raster/fragment stages)

Synchronization Granularity

Buffer: whole logical buffer (range/sub-allocation of physical buffer)
Texture: sub-resource, or whole resource if it's never bound as RTV/DSV/UAV

Performance

It is expected to not introduce any regression in performance.

Further works

These works are now made possible

  • Residency Set: to further reduce CPU overhead
  • {Begin|End}UAVOverlap in nvapi
  • Migration to Metal 4

@K0bin
Copy link
Copy Markdown

K0bin commented Jun 25, 2025

I'm curious: what's making you go for manual tracking when you're implementing an API with automatic barriers on top of an API with automatic barriers?

@3Shain
Copy link
Copy Markdown
Owner Author

3Shain commented Jun 25, 2025

I'm curious: what's making you go for manual tracking when you're implementing an API with automatic barriers on top of an API with automatic barriers?

  1. A prerequisite for resource suballocation. Currently every buffer allocation takes a minimum of 4kB (1 page) memory, which is not ideal especially on WoW64, thus to suballocate from a large buffer is desirable. However, with automatic tracking, it can create false dependency when two encoder read from/write to non-overlapping ranges of the same buffer.
  2. make Begin/EndUAVOverlap implementation possible
  3. it's faster, so why not?

@K0bin
Copy link
Copy Markdown

K0bin commented Jun 25, 2025

The sub allocation one is definitely a good point. I always thought that the Metal 3 fence API would be too much of a mess to manage.

@3Shain 3Shain marked this pull request as ready for review July 1, 2025 10:50
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch 2 times, most recently from c21758b to 22d350f Compare July 4, 2025 14:26
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch from 22d350f to 56ea8bd Compare September 17, 2025 01:38
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch 2 times, most recently from 6671b9b to a9300cf Compare December 17, 2025 09:07
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch 2 times, most recently from 93ebb08 to 2a82557 Compare February 11, 2026 06:59
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch from 2a82557 to e09ca6c Compare March 20, 2026 14:51
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch 3 times, most recently from fd50b91 to 3622738 Compare April 1, 2026 12:32
@3Shain 3Shain mentioned this pull request Apr 3, 2026
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch from 3622738 to b2c13d6 Compare April 6, 2026 23:05
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch 3 times, most recently from 1e03c27 to 9ac2073 Compare April 8, 2026 20:41
@3Shain 3Shain changed the title WIP: Fence Resource Synchronization Apr 8, 2026
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch 3 times, most recently from 15045d9 to 2c90411 Compare April 13, 2026 08:10
@3Shain 3Shain force-pushed the refactor/fence-synchronization branch from 2c90411 to 92cb1bb Compare April 13, 2026 08:24
@3Shain 3Shain merged commit 5e886a1 into main Apr 13, 2026
15 checks passed
@3Shain 3Shain mentioned this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants