Conversation
|
I'm curious: what's making you go for manual tracking when you're implementing an API with automatic barriers on top of an API with automatic barriers? |
Owner
Author
|
|
The sub allocation one is definitely a good point. I always thought that the Metal 3 fence API would be too much of a mess to manage. |
c21758b to
22d350f
Compare
22d350f to
56ea8bd
Compare
6671b9b to
a9300cf
Compare
93ebb08 to
2a82557
Compare
2a82557 to
e09ca6c
Compare
fd50b91 to
3622738
Compare
Merged
3622738 to
b2c13d6
Compare
1e03c27 to
9ac2073
Compare
15045d9 to
2c90411
Compare
So that write data hazards can be detected.
…rier for fragment-to-fragment stage
Because depth or stencil might be readonly and bound as SRV. With initial write access it introduces false barriers on read.
Note it doesn't apply to texture that is read from shader only (although it can still be a blit target) to reduce CPU overhead.
2c90411 to
92cb1bb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements:
The old implementation
#51 introduces a "half" hazard tracking based on Partitioned Bloom Filter, and it's only used to provide information about "whether two encoders' order can be swapped" to find more opportunities of encoder coalescing optimization, whereas the Metal 3 driver will still perform its own hazard resolving, thus essentially it is doing double work. Also there is no way to temporarily opt out automatic hazard tracking even when it can be proved that multiple accesses can be safely overlapped.
The new implementation
Therefore a new fence-based synchronization is implemented in this PR.
Intro
First of all let's focus on synchronization of encoders and ignore intrapass barriers. Let's say we have encoder A and B. B is encoded after A. They both access the same resource R.
Then it is straightforward to implement such algorithm:
last_write, and a list of all previous read encoderslast_readslast_writeand current encoderlast_readslast_writeand current encoder, and setlast_writeto current encoderlast_readsand current encoder, and clearlast_readsWhat if a encoder firstly writes then reads a resource? Well that should be handled by intrapass barrier, but we ignore it at the moment and assume an encoder would either read or write a resource. Although there is a more practical problem to consider: what is the maximum size of
last_readslist, if a resource is never written? Ideally the app should create immutable resource, but it's also not uncommon to have a normal resource that is occasionally updated by blit operation and frequently read from shaders. Thankfully, there is an upper bound in theory: because we don't need to insert a fence if the encoder is known to be completed.Lanes & Parity
Let's think from a different angle: what's the number of encoders that can be concurrently executed on GPU? In practice this is often a single-digit number, not because of hardware capabilities, but the nature of workload: most of work introduces data dependency, and other highly parallelizable works often have a high shader occupancy (meaning the GPU can utilize most/all of cores to execute the workload). Then we can assume a reasonable number of maximum concurrent encoders
K( in code:kLane, value is 64 in this implementation), and introduce the concept of lane and parity.A lane is a virtual timeline, there will be K lane and we round-robin assign each encoder a lane in commit order. Each encoder is also assigned a fence (
MTLFence) that it will unconditionally update, so any following encoders depending on its work can just wait that fence. All previous encoders of one encoder, up to the previous nearest encoder in the same lane (not included), is in the same parity (so a parity is a set containing K encoders). And similarly, we can define the previous parity. Every encoder will unconditionally wait all K encoders (' fence) in previous parity (including previous nearest encoder in the same lane), and conditionally wait encoders in current parity if there is a hazard/dependency.In this setup, even though every encoder has no data dependency with others, a maximum K-concurrency is still maintained. Now back to the question of how many encoders the
last_readsshould store: the answer is just K, because at most K encoders can be concurrently executed, solast_readsonly needs to maintain encoders in current parity.The total number of parities
P(in code:kParity) will also affect the behavior of algorithm. We've mentioned "previous parity", meaning there should be at least 2 parities, but a single parity actually works as well - it just make the whole dependency graph degenerated to a single linked list. So parity of 2 is good? It turns outP>=3is even better.FenceSetand encoder reorderingSo our algorithm only needs
K*Pfences (MTLFence) in total if we reuse fences. It's such a small number that we can simply use bitset to maintainwait_fences/update_fencesof an encoder. Bitset makes union and intersection extremely cheap to compute, and these result are particularly useful for checking if two encoders have data hazard and merging two encoders when possible.Now back to the reason
P>=3is better: encoder reordering is only possible whenP>2. WhenP=2, one encoder's next encoder will update the fence that is also updated in previous parity and thus waited by the encoder, so any adjacent encoders cannot be swapped.The old Partitioned Bloom Filter implementation now is removed, we have deterministic dependency information now.
Fence simplification (
FenceLocalityCheck)If A wait B and B wait C, then A don't have to wait C.
FenceLocalityCheckimplements such simplification by recording lastK*Pwait fences and some magical bitwise operations. This is not a necessary step for correctness, but it is cheap so let's do it anyway.Render encoders
Render encoder adds extra complexity because vertex (non-fragment) and fragment stages are different synchronization scopes. Basically the solution is to introduce an additional virtual encoder.
Barriers
Now we finally consider intrapass barriers.
The idea is still simple, essentially can be demonstrated by a state machine:
These states are virtual, inferred from other states (see
GenericAccessTracker).And of course render encoder adds complexity: the state machine now have 16 states with 4 actions (read & write for pre-raster/fragment stages)
Synchronization Granularity
Buffer: whole logical buffer (range/sub-allocation of physical buffer)
Texture: sub-resource, or whole resource if it's never bound as RTV/DSV/UAV
Performance
It is expected to not introduce any regression in performance.
Further works
These works are now made possible
{Begin|End}UAVOverlapin nvapi