-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose primitive ordered pixel shaders #108
Comments
Can I get a response from the team ? |
Our stance is that we don’t want to implement it. It messes with subpasses and is not the right formulation |
What exactly do you mean by "not the right formulation" ? Is this extension somehow the wrong abstraction to map to this feature inside the hardware ? If so, is there a better way to expose it like a potential framebuffer fetch extension ? (I don't think AMD HW supports multiple render targets so things could get sketchy) How is the interlocks extension different compared to your "primitive ordered pixel shaders" ? Because we badly need this extension or something similar to a framebuffer fetch. |
How would you like to proceed with this ? Can we at least get a vendor extension from AMD exposing this directly if your team doesn't like how interlocks are specified or would you prefer to close this issue if you have no intention of exposing similar functionality in Vulkan ? I am requesting this because there's arguably a stronger case to have ROV-like functionality exposed in Vulkan rather than in D3D12 because there's a higher interest from open source projects in using it than AAA game engine developers. |
@Degerz @jinjianrong it's sad to see AMD has no plans to support this extension on AMD Vulkan driver even now that VK_EXT_fragment_shader_interlock is a de facto "standard" by the fact that is supported by all other vendors (NV &Intel) and all other OSes (Windows & Linux): heck even since Metal2.0 on MacOS we have support for exact same feature.. on Vega cards on Imac Pro we get "RasterOrderGroupsSupported".. adding more use cases: well in fact, Xenia emulator D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.. also Xenia is working towards adding Linux support: EDIT: |
@oscarbg Good idea to get more people interested in this functionality, I think I'll do the same as well! While you're at it, can you go request other AMD engineers like @Anteru on twitter to show that the community wants this functionality as well on AMD HW on Vulkan. cc @tadanokojin @hrydgard @pent0 The above have actively expressed interest and/or already using functionality similar to shader interlock in their projects. One of their main motivations to using Vulkan is getting access to modern GPU features like interlock so we'd prefer it from AMD if we didn't have to move over to platform specific APIs like Metal or D3D12 to be able to use this feature! Vulkan subpasses are possibly not powerful enough for their purposes. I don't care if AMD doesn't ever expose |
I just want to do programmable blending. If you guys can provide another primitives it would also be ok, but this is best. Texture barrier (for opengl) is what I am using but not the fastest path really (also for vulkan if appliable). I dont really know how you guys would do it though |
Here is my suggestion. Use the extension and tell users to switch to either Intel or Nvidia graphics hardware because AMD refuses to support the extension and cite this issue. Watch AMD backpedal on this very quickly after an executive hears about the situation. Also, ask the RADV developers to implement support so that Windows users who want to use software that depends on it have the option of switching to Linux for it. |
From the passes point of view, what's different in this from regular image/buffer stores? |
@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ... @Triang3l Here are some insights from Sascha. Along with the stated limitations, I do not think that Vulkan subpasses are capable of handling self-intersecting draws just like OpenGL's texture barrier. |
News article on Phoronix could help with this a bit. |
All that you need is for end users to start telling each other that AMD graphics hardware is not friendly to emulators after they start asking why it doesn’t work on AMD graphics hardware. It will reach the upper echelons when they are trying to figure out why they did not meet their sales projections. As for the mesa developers, they might not know that this extension has any use cases. I was under the impression that those working on RADV were volunteers, so if you don’t ask them about it, they seem less likely to implement it. |
@ryao TBH, I feel it is more constructive for developers like @pent0 to just express their desire to expose this feature and just list out their use cases instead ... At the end of the day, advanced system emulation doesn't even account for the fraction of AMD's customers and the emulation community is already aware that AMD has a checkered history with them so the leading hardware vendor is already favoured over there. I'd prefer it if we can show that their driver manager's position is out of touch with the community's position because unlike with higher ups such as executives there's no guarantee that they'd understand this issue or that they'd be specialists regarding GPU drivers to help us out. |
I love you guys. I know you can do it however the hard it's. Go go go! We all want this feature. Also programmable blending is not what emulators want also, it's also what many game developers desired to achieve nice and godly effect on their game, which fixed blending can not do. I can't bring an example for PC but here is an example one doing on Metal IOS. Programmable shader pipeline has been here for 15 years, so programmable blending should be too. Its the defacto nowadays.(I copy this quote from this article). I am really bad at wordings hehe, I just express what many people want. Reconsider please :) |
Who exactly does this affect anyway? just PowerVR GPU users? if so I can understand why AMD doesn't want to dedicated valuable development time to this endeavour. Nothing is stopping the community adding it themselves thanks to OPEN-SOURCE drivers! |
In relation to the topic and example given! How is this not obvious?
Then that's not the best example, it would have been better to give examples that are not fringe case but more common use. Take a chill pill mate! |
Its not only relevant to the emulation, its related to everything. DXVK needs interlock to implement ROV, games developer need it to do programmable blending. It affects many things hence this extension exist. Please think more. |
@pent0 I get that you are angry. If you feel misunderstood, express that feeling. Make your points and cut out the strong language. Strong language won't help your cause (for innocent bystanders it looks like you're grasping at straws), and it is disrespectful. Your point is much, much stronger without the strong language. |
Really sorry for the bother! My point still stands anyway, it helps many things, not just emulation for PowerVR. Its's expressed upper.
|
@Degerz Here is the feedback from our Vulkan team regarding the extension: Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications. Additionally, this is an inefficient method of performing the typical thing it's often advocated for - order independent transparency. For such an effect we would usually recommend using one of the many two-pass OIT algorithms out there, and making use of multiple subpasses, with the second pass doing the resolve. This is likely the most portably efficient mechanism you can use that works between desktop and mobile parts. We're thus not inclined to support it, as we'd rather not promote an inefficient technology. However, if you're looking to do direct emulation, we are not sure that really helps you - perhaps you could elaborate on what it is you're trying to emulate exactly and we may be able to advise on an alternative method? |
It's being used in Just Cause 3 and GRID 2. https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization |
I don't know much about these stuffs, so I will let the guys who know discuss. I will try to get a workaround for now. Hi, in our case, we are trying to emulate a feature from the PowerVR GPU. It's that you can fetch the last fragment data of a texel in color buffer and use it for blending inside the fragment shader. It's like blending but not fixed, but inside the shader (programmable). For OpenGL on AMD, we are using texture barrier. On Vulkan I'm not sure if that's available (the only thing I know is pipeline barrier so far, but I will look more). What would you advice me to do in this case? Edit: @Degerz was asking for our case, thanks! I was not aware of you asking this before you ping me. |
@jinjianrong Thank you very much for the response! Support for interlocks/ROVs aren't that compelling in D3D because engine developers are more interested in targeting higher hardware compatibility than using the latest features like I mentioned. By comparison, there are already open-source desktop(!) OpenGL applications out there that are already using interlocks or framebuffer fetch and we would like to be able to target both both Windows and Linux on AMD's Vulkan drivers. Also, we don't want to implement order independent transparency with Vulkan subpasses. We want to have the same capability to do programmable blending for emulation purposes and for this reason alone Vulkan subpasses are not a powerful enough mechanism for this purpose since it possibly(?) can't handle self-intersecting draws like we see with texture barrier. I understand from the hardware people's point of view that primitive ordered pixel shaders can place an ordering constraint when executing fragment shaders and that certainly has undesirable effects in terms of increased latency due to the stalling it causes. This feature helps us to emulate systems that have non-standard fixed function blending pipelines and systems that are capable of programmable blending as well via shader framebuffer fetch. The biggest reason why interlocks/ROVs/fetch have an advantage over Vulkan subpasses are because the latter does not cover the edge case of self-intersecting geometry and thus gives in our case incorrectly rendered content! Edit: If I had to rate the severity for the lack of this feature it would almost be as bad as not having transform feedbacks/stream-output available for DXVK and your team also added support for this fundamentally hardware unfriendly feature as well so just like with transform feedbacks we need to also cover some more cases with interlocks as well even if it does have undesirable performance characteristics. |
As another data point as the author of PPSSPP, the popular Sony PSP emulator, the PSP also has a few blend modes that cannot be replicated without fragment shader interlock or similar programmable blending functionality. Now, games don't actually use them much and don't generally use them for self-overlapping draws, so framebuffer copies work in practice to emulate them, but for fully hardware-accurate emulation this would be useful. I'm not directly involved with Xenia, have only followed its development from the side, but it needs this functionality to simulate some framebuffer formats that only exist on the Xbox 360 and are heavily used by games. They're not practically feasible to emulate in other ways, |
@jinjianrong Xenia needs this for pretty much everything in render target emulation:
|
The unordered variant of this extension is essential for voxelization/global illumination/volumetric rendering. Otherwise what option is there for avg or max blending voxels other than clunky atomiccompswap spinlocks? Couldn't we at least have the unordered variant? Even if there were no use case why can't developers just have another tool in the tool box for coming up with new algorithms? The extension is also requested here, please come over and register your interest. |
@jinjianrong I'm sure you've realized it by now but our stance on this issue as a community is non-negotiable so we do not desire to seek the 'alternatives' that you speak of. I understand the anxiety your team is facing right now since they're going to expose an unfriendly hardware feature and if you absolutely cannot have the general public wanting to access this feature then I have a solution which is applying whitelists to these community projects specifically to be able to access this feature in the driver. Is whitelisting certain applications a viable solution at your end for our case ? |
@Degerz Whitelisting will make adding this to new projects impossible, it would never exist in Xenia if we had to go through any procedure of being added to a whitelist (and ROV usage there began as an experiment anyway), and that's the opposite of how PC gaming works. |
@Triang3l Then what other solutions do you suggest to an unwilling driver team ? If new projects just pop up, then they should arguably just file an appeal since AMD does not like the way applications could potentially use this feature. |
Except for Total War : Three Kingdoms The funny thing is that we're porting a game to ChromeOS to run as an android application, inside the ANGLE sandbox prison, and even GLES 3.1 implemented by ANGLE reports And seems AMD is a special boy. |
We could add this into AMD's Mesa GL driver, or we would accept a 3rd-party contribution adding this feature there, at least to the extent of what DX supports. |
What's happening on RDNA 3 with POPS, by the way, with |
In hindsight, that would be better than nothing. |
@ryao I'm currently researching this (being an ISV and a wannabe contributor (froghacker specifically 🐸), not an AMD engineer), most unanswered questions currently are on the register setup side and things like potentially needed implicit barriers between changes related to multisampling and VRS, though I can't promise anything. My current plan on the shader side (there on GCN5/RDNA/RDNA2 it consists of two parts — overlapped wave awaiting, and then a loop running the critical section code for each overlap layer within the current wave, effectively splitting a part of the shader into smaller "subgroups") so far is:
|
On GCN 5, I'm getting horrible hangs if Update: the issue is at least partially in the sample itself. Update 2: this is intended behavior, MSAA causes adjacent polygons covering the same pixel at the common edge to overlap each with |
Since it's Halloween, I need to say that the way it's not the right formulation, especially when it comes to how operations aware of which lanes are currently active (like ballots) interact with the intrawave collision loop (specifically, intrawave collisions result in a part of the shader within one wave being executed first for overlapped lanes, then for overlapping lanes, then for other overlapping lanes, and so on — each time with a narrower set of active lanes than outside the CS), is extremely SCARY and spooky 🙀😿 For example: uint64_t before = ballotARB(true);
// let's say `before` is …0000111111111111
beginInvocationInterlockARB();
uint64_t during = ballotARB(true);
// first iteration: `during` = …0000000000001111
// second iteration: `during` = …0000111111110000
endInvocationInterlockARB();
uint64 after = ballotARB(true);
// `after` is …0000111111111111 again This is clearly not right from the point of view of GLSL and SPIR-V (or in an even more horrifying example, from the point of view of ROV loads/stores in HLSL) — there are no control flow constructs (conditionals, loops, returns) in the shader code that would suggest that Yesterday I was thinking how they can be handled more or less safely, but it looks like for that, it would be necessary to locate all dependency chains of every ballot that cross the boundaries of the critical section, and include them in the CS. However, this has at least two issues. One inconvenience is that by including a dependency chain of a ballot crossing the CS boundaries into the CS, you're expanding the CS, thus changing its boundaries — and some ballots that could stay outside previously now may have to be moved into it. Though this can be solved by just running this pass again and again until it makes no more changes. But a more severe problem, that I already explained in my previous message, but that I have to highlight the painfulness of, is that obtaining the dependency chain of something is highly non-trivial when variables (including dynamically indexed arrays), or what's even worse, global memory, are involved. Basically, if you're writing into a buffer or an image, potentially any buffer/image load (with One potential solution that I thought about was forcing all ballots to be inside the critical section. But there's an obvious flaw in it, so I'm of course not going to use it — that would outright ruin the last ballot in the shader that was outside the CS in the original code. Specifically, that would change: criticalSection {
// ROV accesses here
}
uint64_t lanesRemaining = ballotARB(true);
while (lanesRemaining) {
// some non-uniform resource access scalarized manually here
} into: uint64_t lanesRemaining;
criticalSection {
// ROV accesses here
lanesRemaining = ballotARB(true);
}
while (lanesRemaining) {
// some non-uniform resource access scalarized manually here
} thus I'm really not sure that I want to spend a huge amount of time trying to untangle all this mess, so at least at first I'll probably just leave a // FIXME for now. While this would of course result in behavior that makes no sense from the SPIR-V or GLSL point of view, at least the change of the set of active lanes would happen in locations that are predictable and can be taken into account. Specifically, they will be the Of course we could use more radical solutions, such as going Intel's |
@jpark37 While this was very long ago, if you still remember the details of your tests, could you please provide the settings you had in the Intel OIT sample? Most importantly, was MSAA used in your test run on AMD, and what exact algorithm was used without ROV in your testing setup (if any OIT at all)? MSAA specifically has a massive performance hit with ROV on AMD due to adjacent primitives overlapping each other as I found out two comments above, but that applies only to the PixelInterlock modes. Without MSAA, or with MSAA in the SampleInterlock mode, in Nvidia's OIT sample, in a spinlock -> interlock comparison, I was getting a ratio similar to your Intel and Nvidia results (on the RX Vega 10, without MSAA, 22ms > 26ms if I recall correctly). MSAA with PixelInterlock, on the other hand, was closer to what you were getting on AMD, though even worse — a 15x-ish increase. However, the spinlock is also an approach that's very hostile to parallelism, so maybe the spinlock was just slow in the first place, and the interlock turned out to be just slightly slower. Though I'll probably also try running it by myself when I finish other tasks. I also wonder, by the way, since it's a D3D11 sample, whether implicit UAV barriers might have caused a significant drop, or was ROV actually the bottleneck there. |
Sorry, I don't.
This much is unlikely though because I'll always disable MSAA if the option is in front of me. |
3 AAA games use it, AMD has this feature on DX12. Also we plan to use it to do CSG in a single pass for a CAD app, at this rate we'll open a popup saying "Buy a real GPU" and open a browser with Amazon and Ebay searches for Nvidia and Intel when we detect an AMD GPU. |
Yes, and I did more research recently for my future blog post — and the deterministic ordering, and thus the lack of temporal noise, in overflow handling makes fragment shader interlock a much more reliable solution even to order-independent transparency compared to other two-pass methods like with a spinlock. This was also cited in the GRID 2 article (for them even 2 nodes per pixel were enough for order-independent transparency, and 4 nodes for Adaptive Volumetric Shadow Maps for smoke lighting), in the MLAB benchmark. Without fragment shader interlock, no matter how advanced your tail blending algorithm is, it will always be broken anyway — because you'll just have incoherent noise if any overflow happens. Like an analog TV with no antenna connected, on your trees or in your glass panes. Additionally, with fragment shader interlock you can do OIT partially, coarsely sorting large batches of geometry (level map tiles, objects, meshlets), and doing fine OIT inside those batches and between nearby batches, including to handle intersecting polygons (which are very common in foliage). And if you sort batches by the farthest depth in them (conservatively is enough), with fragment shader interlock, you can compare the sort key of the current batch with the closest OIT fragment depth in the pixel so far — and if it turns out to be closer, with fragment shader interlock, you can just safely resolve OIT as soon as that happens and free all your OIT layers for reuse (without causing any pipeline stalls for pixels that don't need OIT, unlike an explicit resolve pass, even if stencil-masked, with pipeline barriers — which also wouldn't work with instancing or mesh shading) as you know that all new fragments will be closer from that moment on. This can effectively provide you infinite layers in the view, with a small number of layers within object "clusters" needed in the RAM. Other uses that come to mind are deferred decals — blending into the normal G-buffer, as demonstrated by Nvidia (especially useful for decals on curved surfaces); or drawing huge numbers of sorted particles with a custom blending equation (like Hard Light for both lightening and darkening), as well as with per-particle blending equation selection (especially useful with bindless textures — to have all the additive fireworks and all the alpha-blended smoke in a single draw command with correct ordering between each other — and you can't just put fireworks in one draw command and smoke in another, as you wouldn't be able to mix ordering of the particles between the two). On the implementation side, by the way, I'm somewhat worried about the changes to POPS setup introduced by GFX11. Specifically, the POPS_OVERLAP_NUM_SAMPLES setting is now gone. Currently it's difficult for me to allocate the money to purchase a testing device, so I can't check this by myself. But can someone (@Anteru possibly?) please confirm, how does POPS behave on RDNA 3 with sample-rate shading? Direct3D and Metal (and Intel's old extension) only require sample-granularity interlocking with sample-frequency shading. However, Vulkan and OpenGL fragment shader interlock gives explicit control of the interlock scope to the shader via its execution mode — so it's still possible to request pixel-level interlock, which offers wider guarantees, in a sample-rate shader (like via POPS_OVERLAP_NUM_SAMPLES = 0 on Vega/RDNA/RDNA2). And if the device supports the |
You briefly mentioned custom blending equations, and many have also mentioned wanting this extension for programmable blending, it's possible to use feedback loops for that purpose https://registry.khronos.org/vulkan/specs/1.2-extensions/html/vkspec.html#renderpass-feedbackloop Mesa zink uses this to implement fbfetch (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12603), which Mesa uses to support the PCSX2 uses it to support non-standard blend modes. This requires splitting render passes and using barriers (with VK_DEPENDENCY_BY_REGION) to ensure sync. Performance is reasonable on Nvidia (even with a very large number of draws), however, AMD doesn't fully support VK_DEPENDENCY_BY_REGION and has worse performance as a result, see this Reddit post & AMD Community post, which seems to be a hardware limitation. I don't know if OIT can be supported by the above approach. and I don't know how performance compares to using shader interlocks w.r.t. custom blending. |
@TheLastRar Fragment shader interlock and shader framebuffer fetch (not an explicit barrier that causes excessive synchronization — especially on AMD which doesn't have BY_REGION as far as I know, but even on mobile tiled GPUs what's basically expected from BY_REGION at worst is simply not flushing/reloading tile memory, making the barrier just tile-local rather than global — and doesn't support draw commands with overlap, let alone intersecting primitives; but Arm's ordered version instead) both can be used to implement programmable blending, however, they're quite different in the details, so it would be ideal if hardware supported both, but with Intel being the only PC graphics card developer supporting the latter currently (while all the biggest 3 have fragment shader interlock in their hardware and at least some of their drivers), FSI is effectively the only option on the PC now. But in general, fragment shader interlock offers massively more flexibility than shader framebuffer fetch. The only advantage of SFBF I can imagine is that it supports late depth/stencil test, so it works directly with things like alphatested surfaces. With fragment shader interlock, the write happens in the shader, so your only choices are early depth/stencil (with post-depth coverage with MSAA), which only works for opaque surfaces not modifying the depth or the stencil reference from the shader, or full-blown software depth testing. However, FSI, being a shader part rather than an output-merger one, allows for arbitrary addressing, and that removes lots of limitations:
Also I'm not entirely sure about the requirements of Arm's rasterizer-ordered attachment access (maybe subpassInputMS is allowed, but I don't know for sure), but compared to OpenGL ES SFBF, FSI has one advantage for MSAA — you still can use pixel-frequency shading with FSI, and access per-sample data based on the input coverage mask with sample interlock, or per-pixel data as usual with pixel interlock. The OpenGL ES SFBF specification says: "Reading the value of gl_LastFragData produces a different result for each sample. This implies that all or part of the shader be run once for each sample…", but in reality as far as I know it's always full fallback to sample-rate shading, which cancels out the idea of MSAA on the performance side. Note that with FSI, you can still take advantage of texture tiling (if I understand correctly, it's even the same 64KB_R_X for both framebuffer attachments and storage images on RDNA and RDNA 2 normally), and as far as I know, modern AMD GPUs support internal compression for storage images as well. But again, what's the most important is that there's no SFBF anywhere on the PC except for Intel GPUs (and maybe Innosilicon, Moore Threads, though I don't know the details about them). |
Another fun use for ROV is rendering to and blending non-renderable formats like RGB9E5 or some custom stuff. |
IIRC unless you have the brand new Vulkan ARM/EXT externsion meant to replace OpenGL ES Pixel Local Storage or FramebufferFetch.... subpass feedback loops are only limited to a single pixel overwrite cycle, then you need a barrier (can be by-region). PLS + PSI are best of both worlds, because you can use the local framebuffer/tiler memory to store your MLAB4 buckets and not a |
I guess this is not an April Fools' Day joke? 😜 Even though this situation is complete tragicomedy and farce 🤷♂️ |
Fantastic news, thank you for your work on this! |
What a chad 💪😎 |
It just got merged into Mesa RADV devel |
Congratulations, @Triang3l, and well done. AMD users are indebted....only appreciation and admiration for this awesome contribution. :) |
Brilliant! The FOSS community has done it again 🎉 |
Hello AMD Vulkan developers! Has your stance on supporting this extension changed within the last 4 years? There seems to be plenty of examples given here on how this would affect the user experience on any AMD card using the Windows proprietary drivers, so I would really like to know your current (updated) thoughts on this matter. Best regards, Moonlacer |
VK_EXT_fragment_shader_interlock has been added to amdvlk in 194a181da7e2cca5f70ec0f9e65119955b3d2b47 |
Thats not windows though is it? |
It's the same driver only the compiler may differ.
…On Fri, Dec 29, 2023, 7:08 PM Rin ***@***.***> wrote:
VK_EXT_fragment_shader_interlock has been added to amdvlk in
194a181da7e2cca5f70ec0f9e65119955b3d2b47
<GPUOpen-Drivers/xgl@194a181>
Thats not windows though is it?
—
Reply to this email directly, view it on GitHub
<#108 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMR5AVOIZPCHI7DMIFHJSDYL6AT3AVCNFSM4IMMGGH2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBXGI2DGNZSGIZQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The same commit also added |
That commit doesn't implement it, only makes it "known" to the extension management in the driver, and the newly added device feature query reports that all interlock features are unsupported. But it's an awesome sign that it's soon™! Unfortunately no new AMDVLK will be released for the Vega generation though. |
VK_EXT_fragment_shader_interlock is added as of Adrenaline 24.12.1 |
According to the Vega ISA documentation, this feature uses the SOPP scalar microcode format. Currently, this feature is only exposed in AMD's D3D12 drivers as "rasterizer ordered views" so I'd like to see a mirror equivalent supported in Vulkan as well known as
VK_EXT_fragment_shader_interlock
.We need this feature to emulate a certain PowerVR GPU for our use case and particularly we want the
fragmentShaderPixelInterlock
feature exposed from the extension so can your team enable this for the Vulkan drivers ? (Bonus if the team can also getfragmentShaderSampleInterlock
exposed too) Also if you are working on this extension can we also get an estimate of when a driver will be released to support the extension/feature ?The text was updated successfully, but these errors were encountered: