refactor: resource management, dependency tracking & encoder reordering#51
Merged
refactor: resource management, dependency tracking & encoder reordering#51
Conversation
MetalFX spatial upscaling is broken in some cases, particually if the swapchain is created with sRGB format no ci
This commit temporarily breaks deferred context
42f3ae1 to
6a91256
Compare
…apchain description
Replaced a bunch of com_cast with static_cast or reinterpret_cast
Merged
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains complete refactoring on resource management & command translation for one motivation: let DXMT issue Metal commands that's more efficient for Apple GPU (which has TBDR architecture).
Apple provides some guidelines on optimization, and one major aspect is Encoder Coalescing. Typically, when one render encoder storing some attachments is followed by another render encoder that loads the same set of attachments, then both render can be merged as one encoder, saving memory bandwidth for 1 store + 1 load. This PR will make DXMT try to identify encoders that can be coalesced, even for non-trivial cases (e.g. two coalesce-able encoders with another encoder that has no data dependency with others in middle). That means, sometime DXMT will change the order of encoders, if it's beneficial and the ultimate effect is the same.
Before this PR, all D3D11 commands are firstly written to DXMT internal command ring buffer, and when
Present(),Flush()or any synchronization happens, the ring buffer then will be executed by a dedicated thread that encodes actual Metal commands. However, in this PR a secondary internal command buffer is introduced. Any commands written in the original primary command buffer have to be executed in original order and populate secondary command buffer that can be executed out-of-order. In the mean time all dependency and residency information are collected and eventually feed into an algorithm that re-order and identify coalesce-able encoders (by the way, even the current optimization algorithm is very powerful, it is still not in its final form yet: there is still room for improvement. Thus we omit the details here).But how do we know if two encoders have data dependency to determine a change of order is possible? Of course we need to know if there is a common element in lists of read & written resources. However enumerating and comparing resource lists sounds like a no-go since the time-complexity is$O(n^2)$ . Thankfully we have a very powerful tool: Partitioned Bloom Filter. You may have heard of Bloom Filter that can tell you if one element is in a set immediately with a small probability of false positive. Partitioned Bloom Filter is an enhanced version that can be used to test if two sets have any intersection in constant time. Although false positive is still possible, the probability can be controlled. And remember what we are doing is optimization, even if we get false positive, we lose nothing. And the fact that take union of Partitioned Bloom Filter is also a constant-time operation makes it even better.
The capability of encoder reordering is not the only benefit that a secondary command buffer gives. It also simplifies implementation of Deferred Context and in fact eliminates a drawback of Deferred Context: a game uses Deferred Context tends to render different parts of the same scene in different threads, there a lot of coalesce-able encoders are created, but that's not a problem any more. Another optimization unlocked by secondary command buffer is resource renaming: when a resource is fully cleared/discarded, a fresh resource of the same descriptor is created or allocated form a pool instead. It simplifies the dependency between encoders which is not only good for GPU parallelism, but also helps our reordering algorithm to detect more potential optimizations.
Overall this PR provides performant boost for devices with limited memory bandwidth (typically non-Max/Ultra chips) and a solid ground for further optimization.