Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose primitive ordered pixel shaders #108

Closed
Degerz opened this issue Aug 16, 2019 · 154 comments
Closed

Expose primitive ordered pixel shaders #108

Degerz opened this issue Aug 16, 2019 · 154 comments

Comments

@Degerz
Copy link

Degerz commented Aug 16, 2019

According to the Vega ISA documentation, this feature uses the SOPP scalar microcode format. Currently, this feature is only exposed in AMD's D3D12 drivers as "rasterizer ordered views" so I'd like to see a mirror equivalent supported in Vulkan as well known as VK_EXT_fragment_shader_interlock.

We need this feature to emulate a certain PowerVR GPU for our use case and particularly we want the fragmentShaderPixelInterlock feature exposed from the extension so can your team enable this for the Vulkan drivers ? (Bonus if the team can also get fragmentShaderSampleInterlock exposed too) Also if you are working on this extension can we also get an estimate of when a driver will be released to support the extension/feature ?

@Degerz
Copy link
Author

Degerz commented Aug 23, 2019

Can I get a response from the team ?

@jinjianrong
Copy link
Member

Our stance is that we don’t want to implement it. It messes with subpasses and is not the right formulation

@Degerz
Copy link
Author

Degerz commented Aug 23, 2019

What exactly do you mean by "not the right formulation" ? Is this extension somehow the wrong abstraction to map to this feature inside the hardware ?

If so, is there a better way to expose it like a potential framebuffer fetch extension ? (I don't think AMD HW supports multiple render targets so things could get sketchy)

How is the interlocks extension different compared to your "primitive ordered pixel shaders" ? Because we badly need this extension or something similar to a framebuffer fetch.

@Degerz
Copy link
Author

Degerz commented Aug 23, 2019

How would you like to proceed with this ?

Can we at least get a vendor extension from AMD exposing this directly if your team doesn't like how interlocks are specified or would you prefer to close this issue if you have no intention of exposing similar functionality in Vulkan ?

I am requesting this because there's arguably a stronger case to have ROV-like functionality exposed in Vulkan rather than in D3D12 because there's a higher interest from open source projects in using it than AAA game engine developers.

@oscarbg
Copy link

oscarbg commented Aug 24, 2019

@Degerz @jinjianrong it's sad to see AMD has no plans to support this extension on AMD Vulkan driver even now that VK_EXT_fragment_shader_interlock is a de facto "standard" by the fact that is supported by all other vendors (NV &Intel) and all other OSes (Windows & Linux):
you can see on Windows is supported on "recent" NV (>=Maxwell) & Intel GPUs (>=gen9 Skylake):
https://vulkan.gpuinfo.org/listdevices.php?platform=windows&extension=VK_EXT_fragment_shader_interlock
similarly on Linux supported on NV&Intel:
https://vulkan.gpuinfo.org/listdevices.php?platform=linux&extension=VK_EXT_fragment_shader_interlock

heck even since Metal2.0 on MacOS we have support for exact same feature.. on Vega cards on Imac Pro we get "RasterOrderGroupsSupported"..

adding more use cases:
should be useful for VKD3D project for supp. D3D12 "rasterizer ordered views" in case any D3D12 games use it..

well in fact, Xenia emulator D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.. also Xenia is working towards adding Linux support:
xenia-project/xenia#1430
and has a more immature Vulkan backend that Linux backend would use..
Xenia VK backend could take advantage of VK_EXT_fragment_shader_interlock for better/faster emulation of Xbox hardware..
so joining @Triang3l to discussion in case it wants to discuss further..

EDIT:
even it could be supported on MacOS MoltenVK Vulkan driver so I asked for it:
KhronosGroup/MoltenVK#630

@Degerz
Copy link
Author

Degerz commented Aug 24, 2019

@oscarbg Good idea to get more people interested in this functionality, I think I'll do the same as well! While you're at it, can you go request other AMD engineers like @Anteru on twitter to show that the community wants this functionality as well on AMD HW on Vulkan.

cc @tadanokojin @hrydgard @pent0

The above have actively expressed interest and/or already using functionality similar to shader interlock in their projects. One of their main motivations to using Vulkan is getting access to modern GPU features like interlock so we'd prefer it from AMD if we didn't have to move over to platform specific APIs like Metal or D3D12 to be able to use this feature!

Vulkan subpasses are possibly not powerful enough for their purposes. I don't care if AMD doesn't ever expose VK_EXT_fragment_shader_interlock but please at least give another viable alternative for their sake even if it is an AMD specific extension!

@pent0
Copy link

pent0 commented Aug 24, 2019

I just want to do programmable blending. If you guys can provide another primitives it would also be ok, but this is best. Texture barrier (for opengl) is what I am using but not the fastest path really (also for vulkan if appliable). I dont really know how you guys would do it though

@ryao
Copy link

ryao commented Aug 24, 2019

How would you like to proceed with this ?

Here is my suggestion. Use the extension and tell users to switch to either Intel or Nvidia graphics hardware because AMD refuses to support the extension and cite this issue. Watch AMD backpedal on this very quickly after an executive hears about the situation.

Also, ask the RADV developers to implement support so that Windows users who want to use software that depends on it have the option of switching to Linux for it.

@Triang3l
Copy link

From the passes point of view, what's different in this from regular image/buffer stores?

@Degerz
Copy link
Author

Degerz commented Aug 24, 2019

@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ...

@Triang3l Here are some insights from Sascha. Along with the stated limitations, I do not think that Vulkan subpasses are capable of handling self-intersecting draws just like OpenGL's texture barrier.

@RussianNeuroMancer
Copy link

Seems like an unlikely scenario that it would reach to the very high echelons in the company

News article on Phoronix could help with this a bit.

@ryao
Copy link

ryao commented Aug 24, 2019

@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ...

All that you need is for end users to start telling each other that AMD graphics hardware is not friendly to emulators after they start asking why it doesn’t work on AMD graphics hardware. It will reach the upper echelons when they are trying to figure out why they did not meet their sales projections.

As for the mesa developers, they might not know that this extension has any use cases. I was under the impression that those working on RADV were volunteers, so if you don’t ask them about it, they seem less likely to implement it.

@Degerz
Copy link
Author

Degerz commented Aug 24, 2019

@ryao TBH, I feel it is more constructive for developers like @pent0 to just express their desire to expose this feature and just list out their use cases instead ...

At the end of the day, advanced system emulation doesn't even account for the fraction of AMD's customers and the emulation community is already aware that AMD has a checkered history with them so the leading hardware vendor is already favoured over there.

I'd prefer it if we can show that their driver manager's position is out of touch with the community's position because unlike with higher ups such as executives there's no guarantee that they'd understand this issue or that they'd be specialists regarding GPU drivers to help us out.

@pent0
Copy link

pent0 commented Aug 24, 2019

I love you guys. I know you can do it however the hard it's. Go go go! We all want this feature.

Also programmable blending is not what emulators want also, it's also what many game developers desired to achieve nice and godly effect on their game, which fixed blending can not do. I can't bring an example for PC but here is an example one doing on Metal IOS.

Programmable shader pipeline has been here for 15 years, so programmable blending should be too. Its the defacto nowadays.(I copy this quote from this article).

I am really bad at wordings hehe, I just express what many people want. Reconsider please :)

@jarrard
Copy link

jarrard commented Aug 25, 2019

Who exactly does this affect anyway? just PowerVR GPU users? if so I can understand why AMD doesn't want to dedicated valuable development time to this endeavour. Nothing is stopping the community adding it themselves thanks to OPEN-SOURCE drivers!

@jarrard
Copy link

jarrard commented Aug 25, 2019

Who exactly does this affect anyway? just PowerVR GPU users?

In relation to the topic and example given! How is this not obvious?

PowerVR GPU

Then that's not the best example, it would have been better to give examples that are not fringe case but more common use.
Also referring to people as retarded is why stuff like this gets ignored, its quite a anti-open-source attitude to have, and only derails things.

Take a chill pill mate!
THE END

@pent0
Copy link

pent0 commented Aug 25, 2019

Its not only relevant to the emulation, its related to everything. DXVK needs interlock to implement ROV, games developer need it to do programmable blending.

It affects many things hence this extension exist. Please think more.

@jfdhuiz
Copy link

jfdhuiz commented Aug 25, 2019

@pent0 I get that you are angry. If you feel misunderstood, express that feeling. Make your points and cut out the strong language. Strong language won't help your cause (for innocent bystanders it looks like you're grasping at straws), and it is disrespectful. Your point is much, much stronger without the strong language.

@pent0
Copy link

pent0 commented Aug 25, 2019 via email

@jinjianrong
Copy link
Member

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

Additionally, this is an inefficient method of performing the typical thing it's often advocated for - order independent transparency. For such an effect we would usually recommend using one of the many two-pass OIT algorithms out there, and making use of multiple subpasses, with the second pass doing the resolve. This is likely the most portably efficient mechanism you can use that works between desktop and mobile parts. We're thus not inclined to support it, as we'd rather not promote an inefficient technology.

However, if you're looking to do direct emulation, we are not sure that really helps you - perhaps you could elaborate on what it is you're trying to emulate exactly and we may be able to advise on an alternative method?

@ghost
Copy link

ghost commented Aug 25, 2019

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of of demos applications

It's being used in Just Cause 3 and GRID 2.
https://software.intel.com/en-us/articles/optimizations-enhance-just-cause-3-on-systems-with-intel-iris-graphics

https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization

@illusion0001
Copy link

Quote from @oscarbg
Xenia (Xbox 360 Emulator) D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.

@pent0
Copy link

pent0 commented Aug 25, 2019

I don't know much about these stuffs, so I will let the guys who know discuss. I will try to get a workaround for now.

Hi, in our case, we are trying to emulate a feature from the PowerVR GPU. It's that you can fetch the last fragment data of a texel in color buffer and use it for blending inside the fragment shader. It's like blending but not fixed, but inside the shader (programmable).

For OpenGL on AMD, we are using texture barrier. On Vulkan I'm not sure if that's available (the only thing I know is pipeline barrier so far, but I will look more). What would you advice me to do in this case?

Edit: @Degerz was asking for our case, thanks! I was not aware of you asking this before you ping me.

@Degerz
Copy link
Author

Degerz commented Aug 25, 2019

@jinjianrong Thank you very much for the response!

Support for interlocks/ROVs aren't that compelling in D3D because engine developers are more interested in targeting higher hardware compatibility than using the latest features like I mentioned. By comparison, there are already open-source desktop(!) OpenGL applications out there that are already using interlocks or framebuffer fetch and we would like to be able to target both both Windows and Linux on AMD's Vulkan drivers.

Also, we don't want to implement order independent transparency with Vulkan subpasses. We want to have the same capability to do programmable blending for emulation purposes and for this reason alone Vulkan subpasses are not a powerful enough mechanism for this purpose since it possibly(?) can't handle self-intersecting draws like we see with texture barrier. I understand from the hardware people's point of view that primitive ordered pixel shaders can place an ordering constraint when executing fragment shaders and that certainly has undesirable effects in terms of increased latency due to the stalling it causes.

This feature helps us to emulate systems that have non-standard fixed function blending pipelines and systems that are capable of programmable blending as well via shader framebuffer fetch. The biggest reason why interlocks/ROVs/fetch have an advantage over Vulkan subpasses are because the latter does not cover the edge case of self-intersecting geometry and thus gives in our case incorrectly rendered content!

Edit: If I had to rate the severity for the lack of this feature it would almost be as bad as not having transform feedbacks/stream-output available for DXVK and your team also added support for this fundamentally hardware unfriendly feature as well so just like with transform feedbacks we need to also cover some more cases with interlocks as well even if it does have undesirable performance characteristics.

@hrydgard
Copy link

As another data point as the author of PPSSPP, the popular Sony PSP emulator, the PSP also has a few blend modes that cannot be replicated without fragment shader interlock or similar programmable blending functionality. Now, games don't actually use them much and don't generally use them for self-overlapping draws, so framebuffer copies work in practice to emulate them, but for fully hardware-accurate emulation this would be useful.

I'm not directly involved with Xenia, have only followed its development from the side, but it needs this functionality to simulate some framebuffer formats that only exist on the Xbox 360 and are heavily used by games. They're not practically feasible to emulate in other ways,

@Triang3l
Copy link

Triang3l commented Aug 25, 2019

@jinjianrong Xenia needs this for pretty much everything in render target emulation:

  • Blending with piecewise linear gamma, with float7e3.7e3.7e3.unorm2 color format, with exponent-biased 16-bit snorm format (with -32 to 32 range).
  • Float20e4 depth (especially important when games do EDRAM–RAM–EDRAM round trips (GTA IV, Halo 4) and in this case 32-bit floats cannot be used as games reupload the depth buffer to the EDRAM by writing 24-bit depth to GBA of a R8G8B8A8 color render target aliasing the depth/stencil buffer totally destroying invariance). Of course it's not as fast without all the hi-Z, compression and true early Z, but it's more or less acceptable, and with copying to allow for aliasing we would still lose the first two.
  • Fast aliasing without copying (which is also inaccurate as there are some draws without viewport/scissor where we can't even determine the height of the render target — like drawing a DI_PT_RECTLIST to a single render target at RB_COLOR_INFO::EDRAM_BASE 0 (so nothing that would naturally truncate it) with a custom vertex shader and vfetch layout) — aliasing happens a lot even in cases like clears, which are usually done via a 4x MSAA depth buffer even for single-sampled color buffers.
  • MSAA via ForcedSampleCount (without that on GPUs without SV_StencilRef we can't restore the stencil buffer after aliasing and have to fall back to slow SSAA).

@gharland
Copy link

The unordered variant of this extension is essential for voxelization/global illumination/volumetric rendering. Otherwise what option is there for avg or max blending voxels other than clunky atomiccompswap spinlocks? Couldn't we at least have the unordered variant? Even if there were no use case why can't developers just have another tool in the tool box for coming up with new algorithms?

The extension is also requested here, please come over and register your interest.

https://community.amd.com/message/2927066

https://community.amd.com/message/2926956

@Degerz
Copy link
Author

Degerz commented Aug 25, 2019

@jinjianrong I'm sure you've realized it by now but our stance on this issue as a community is non-negotiable so we do not desire to seek the 'alternatives' that you speak of.

I understand the anxiety your team is facing right now since they're going to expose an unfriendly hardware feature and if you absolutely cannot have the general public wanting to access this feature then I have a solution which is applying whitelists to these community projects specifically to be able to access this feature in the driver.

Is whitelisting certain applications a viable solution at your end for our case ?

@Triang3l
Copy link

Triang3l commented Aug 25, 2019

@Degerz Whitelisting will make adding this to new projects impossible, it would never exist in Xenia if we had to go through any procedure of being added to a whitelist (and ROV usage there began as an experiment anyway), and that's the opposite of how PC gaming works.

@Degerz
Copy link
Author

Degerz commented Aug 25, 2019

@Triang3l Then what other solutions do you suggest to an unwilling driver team ?

If new projects just pop up, then they should arguably just file an appeal since AMD does not like the way applications could potentially use this feature.

@devshgraphicsprogramming

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

Additionally, this is an inefficient method of performing the typical thing it's often advocated for - order independent transparency. For such an effect we would usually recommend using one of the many two-pass OIT algorithms out there, and making use of multiple subpasses, with the second pass doing the resolve. This is likely the most portably efficient mechanism you can use that works between desktop and mobile parts. We're thus not inclined to support it, as we'd rather not promote an inefficient technology.

However, if you're looking to do direct emulation, we are not sure that really helps you - perhaps you could elaborate on what it is you're trying to emulate exactly and we may be able to advise on an alternative method?

Except for Total War : Three Kingdoms

The funny thing is that we're porting a game to ChromeOS to run as an android application, inside the ANGLE sandbox prison, and even GLES 3.1 implemented by ANGLE reports NV_fragment_shader_interlock

And seems AMD is a special boy.

@marekolsak
Copy link

We could add this into AMD's Mesa GL driver, or we would accept a 3rd-party contribution adding this feature there, at least to the extent of what DX supports.

@Triang3l
Copy link

Triang3l commented Oct 21, 2022

What's happening on RDNA 3 with POPS, by the way, with src_pops_exiting_wave_id and HW_REG_POPS_PACKER having been removed? Are you not scheduling overlapping wavefronts until the overlapped ones have completed execution now (like EXEC_IF_OVERLAPPED = 0 on the earlier architecture revisions if I understand correctly what it means), or something more interesting?

@ryao
Copy link

ryao commented Oct 21, 2022

We could add this into AMD's Mesa GL driver, or we would accept a 3rd-party contribution adding this feature there, at least to the extent of what DX supports.

In hindsight, that would be better than nothing.

@Triang3l
Copy link

Triang3l commented Oct 21, 2022

@ryao I'm currently researching this (being an ISV and a wannabe contributor (froghacker specifically 🐸), not an AMD engineer), most unanswered questions currently are on the register setup side and things like potentially needed implicit barriers between changes related to multisampling and VRS, though I can't promise anything.

My current plan on the shader side (there on GCN5/RDNA/RDNA2 it consists of two parts — overlapped wave awaiting, and then a loop running the critical section code for each overlap layer within the current wave, effectively splitting a part of the shader into smaller "subgroups") so far is:

  1. Find the locations that dominate all begins and post-dominate all ends (taking advantage of SPV_EXT_fragment_shader_interlock's requirement that an end must be executed dynamically after a begin exactly once — thus even if the begin and the end are in different conditionals, it's still possible to estimate a conservative lower bound).
  2. Move the candidate critical section boundaries out of all outer loops, to handle cases when static domination doesn't imply dynamic precedence (like a loop with if (i == 0) { beginInvocationInterlock(); } else if (i == 1) { endInvocationInterlock(); }), and so that there are no breaks from the original shader that need to be routed from the new loop to the original outer loop.
  3. Find the rectangle in the control flow tree that spans all the beginning and the end points moved outside loops. Check if all ends don't precede all begins (in this case, treat an unclosed critical section like a critical section until the end of the shader for simplicity).
  4. Maybe narrow this region to memory accesses inside it as long as loops are not re-entered as a result. This is especially useful if all the memory accesses in the critical section are conditional, while the begin/end are not (per both the GLSL extension requirement statically and the SPIR-V extension requirement dynamically) — AMD GPUs support returning without entering the critical section according to the implementation of D3D ROVs. Note that coherent should have no effect on this — ordered writes without reads may still be done with FSI without coherent in the Vulkan memory model, I think, availability/visibility is probably not needed for write-only access, though I'm not yet sure how this maps to L1 cache usage on AMD specifically with POPS.
  5. Expand the critical section to include all lane-aware operations such as ballots, so that they're unaffected by the new loop narrowing the exec mask, which is not exposed in any way in the SPIR-V control flow and spans an arbitrary conservative region. Note that it's not always possible to isolate the dependency tree of some ballot because its result may go to memory (or to variables in a non-SSA form, including local arrays). This may not be skipped if there are no subgroup operations inside the critical section even though the purpose is just ensuring consistency of their behavior between the critical section (sequentially running smaller sub-sub-groups for each overlap layer) and the rest of the shader (running code for all the layers at the same time in the full original fragment shader subgroup) — exactly for the reason of the results having unclear dependencies (store a result of a full-wave ballot before the critical section, load it in the critical section on the same control flow level — it's not actual anymore even though from the point of view of SPIR-V it still should be).

@Triang3l
Copy link

Triang3l commented Oct 22, 2022

On GCN 5, I'm getting horrible hangs if POPS_OVERLAP_NUM_SAMPLES doesn't match the (rasterizer?) sample count (though I can't find any setup code for it in PAL or XGL, which is pretty weird — but neither can I for EXEC_IF_OVERLAPPED as well, for example), by the way (haven't checked yet with sample-rate shading, however), so it looks like the per-pixel mode is the only supported there (though per-sample still must be exposed for compatibility with apps relying on it, more ordering is always okay — in an extreme example, a Vulkan implementation processing 1 quad at a time in order is technically valid too). I'm not sure if the glitches I'm getting in Nvidia's order-independent transparency sample with the per-pixel mode in my experiments are something actually related to the interlock or something else, the look of the edges is not stable in time, but the effect doesn't look anything like a race condition between waves (no rectangular patterns) — though this may be something caused by the way I'm handling intrawave collisions in my prototype (not inserting memory barriers between iterations yet, for example), I'm also getting some weirdness in the software spinlock mode in this sample though so I don't know. On more recent hardware, primarily RDNA 2 with VRS, I don't know yet how the sample count should be set up at all, but the registers don't seem like they imply any flexibility in the interlock modes, apparently the hardware is designed only for the requirement of per-fine-pixel interlocking in D3D rasterizer-ordered views and GL_INTEL_fragment_shader_ordering, and possibly Metal raster order groups (though I haven't checked yet whether it's implemented as per-coarse-pixel on RDNA 2 or actually just per-fine-pixel).

Early test of a custom RADV POPS implementation on GCN 5

Update: the issue is at least partially in the sample itself.

Update 2: this is intended behavior, MSAA causes adjacent polygons covering the same pixel at the common edge to overlap each with pixel_interlock, and thus performance naturally drops massively. However, POPS_OVERLAP_NUM_SAMPLES == MSAA_EXPOSED_SAMPLES seems to work perfectly for sample_interlock, this way POPS is pretty fast even with MSAA on GFX9 — as long as you need to access only per-sample data, not per-pixel.

@Triang3l
Copy link

Triang3l commented Nov 1, 2022

Since it's Halloween, I need to say that the way it's not the right formulation, especially when it comes to how operations aware of which lanes are currently active (like ballots) interact with the intrawave collision loop (specifically, intrawave collisions result in a part of the shader within one wave being executed first for overlapped lanes, then for overlapping lanes, then for other overlapping lanes, and so on — each time with a narrower set of active lanes than outside the CS), is extremely SCARY and spooky 🙀😿 For example:

uint64_t before = ballotARB(true);
// let's say `before` is …0000111111111111
beginInvocationInterlockARB();
uint64_t during = ballotARB(true);
// first iteration: `during` = …0000000000001111
// second iteration: `during` = …0000111111110000
endInvocationInterlockARB();
uint64 after = ballotARB(true);
// `after` is …0000111111111111 again

This is clearly not right from the point of view of GLSL and SPIR-V (or in an even more horrifying example, from the point of view of ROV loads/stores in HLSL) — there are no control flow constructs (conditionals, loops, returns) in the shader code that would suggest that during should be different than before and after — they're in the same control flow tree node, there are no returns between them. Interlocking is exposed purely on the level of a single invocation, which is not true in the implementation on AMD.

Yesterday I was thinking how they can be handled more or less safely, but it looks like for that, it would be necessary to locate all dependency chains of every ballot that cross the boundaries of the critical section, and include them in the CS. However, this has at least two issues.

One inconvenience is that by including a dependency chain of a ballot crossing the CS boundaries into the CS, you're expanding the CS, thus changing its boundaries — and some ballots that could stay outside previously now may have to be moved into it. Though this can be solved by just running this pass again and again until it makes no more changes.

But a more severe problem, that I already explained in my previous message, but that I have to highlight the painfulness of, is that obtaining the dependency chain of something is highly non-trivial when variables (including dynamically indexed arrays), or what's even worse, global memory, are involved. Basically, if you're writing into a buffer or an image, potentially any buffer/image load (with restrict, for that specific resorce, but without restrict, for any non-restrict resource of a type with which aliasing is possible) preceded by that store may be a dependency on that store. And you don't even have to actually write the result of a ballot for the dependency to appear: that would be an extremely weird and rare use case (even more like just a test case), however, there's a much more realistic situation — any write to a manually non-uniform-indexed (for which ballots and first lane reads are used commonly) image/buffer is also a memory dependency on the ballot.

One potential solution that I thought about was forcing all ballots to be inside the critical section. But there's an obvious flaw in it, so I'm of course not going to use it — that would outright ruin the last ballot in the shader that was outside the CS in the original code. Specifically, that would change:

criticalSection {
  // ROV accesses here
}
uint64_t lanesRemaining = ballotARB(true);
while (lanesRemaining) {
  // some non-uniform resource access scalarized manually here
}

into:

uint64_t lanesRemaining;
criticalSection {
  // ROV accesses here
  lanesRemaining = ballotARB(true);
}
while (lanesRemaining) {
  // some non-uniform resource access scalarized manually here
}

thus lanesRemaining would belong not to the whole subgroup, but just to the uppermost overlap layer — but later it would be used for providing data for the entire subgroup. Again, for this to work correctly, all the dependencies of lanesRemaining will have to be moved into the critical section.

I'm really not sure that I want to spend a huge amount of time trying to untangle all this mess, so at least at first I'll probably just leave a // FIXME for now. While this would of course result in behavior that makes no sense from the SPIR-V or GLSL point of view, at least the change of the set of active lanes would happen in locations that are predictable and can be taken into account. Specifically, they will be the OpBeginInvocationInterlockEXT and OpEndInvocationInterlockEXT themselves, or, if they are in different control flow nodes, and thus the boundaries of the critical section need to be moved to a parent control flow node to include both, the critical section will start right before and/or end right after a control flow construct, such as a conditional or a loop — where the writer of the shader (unless they explicitly assume that the condition is uniform) would expect the active lanes to naturally potentially change. For this reason, I'll also drop my idea of narrowing the CS to the memory accesses inside it (which would help if the shader, for instance, is written in GLSL and thus adheres to the extremely strict control flow rules of GL_ARB_fragment_shader_interlock, and puts the CS begin/end instructions on the outermost control flow level, but does all the memory accesses between them conditionally), as that would essentially result in something even more broken, unpredictable and uncontrollable — basically in the way ballots interact with ROV accesses on Direct3D.

Of course we could use more radical solutions, such as going Intel's sendc way and disabling intrawave collisions completely if ballots are used if that's possible on the hardware (loading intrawave collisions is switchable separately from loading overlapped wave info for some reason, but I'm not sure if disabling that reliably works) — or if not, wrapping the whole shader in the CS (but then it may be reasonable to switch off EXEC_IF_OVERLAPPED too, but again, I don't know if that can be relied on). However, this edge case of interaction of ballots and FSI is probably too weird for real usage — but what's much more realistic is that a shader would want the critical section to be as short as possible, especially if it does just programmable blending (like some simple overlay/hard light) somewhere in the end, it wouldn't make much sense not to do some heavy, but independent, lighting work earlier in the shader in parallel between overlapped and overlapping invocations, no matter if they are in the same wave or in different. So in the end, I think I'll just keep ballots broken, but broken in a controllable way, as the alternatives that I can imagine seem to be unreasonable.

@Triang3l
Copy link

@jpark37 While this was very long ago, if you still remember the details of your tests, could you please provide the settings you had in the Intel OIT sample?

Most importantly, was MSAA used in your test run on AMD, and what exact algorithm was used without ROV in your testing setup (if any OIT at all)? MSAA specifically has a massive performance hit with ROV on AMD due to adjacent primitives overlapping each other as I found out two comments above, but that applies only to the PixelInterlock modes. Without MSAA, or with MSAA in the SampleInterlock mode, in Nvidia's OIT sample, in a spinlock -> interlock comparison, I was getting a ratio similar to your Intel and Nvidia results (on the RX Vega 10, without MSAA, 22ms > 26ms if I recall correctly). MSAA with PixelInterlock, on the other hand, was closer to what you were getting on AMD, though even worse — a 15x-ish increase. However, the spinlock is also an approach that's very hostile to parallelism, so maybe the spinlock was just slow in the first place, and the interlock turned out to be just slightly slower.

Though I'll probably also try running it by myself when I finish other tasks. I also wonder, by the way, since it's a D3D11 sample, whether implicit UAV barriers might have caused a significant drop, or was ROV actually the bottleneck there.

@jpark37
Copy link

jpark37 commented Dec 2, 2022

While this was very long ago, if you still remember the details of your #108 (comment), could you please provide the settings you had in the Intel OIT sample?

Sorry, I don't.

Most importantly, was MSAA used in your test run on AMD

This much is unlikely though because I'll always disable MSAA if the option is in front of me.

@devshgraphicsprogramming

Why is this closed? The issue is still not resolved.

Because AMD isnt going to resolve it its a feature that is hardly used outside the Emulation community. Some devs are looking at other less faster ways to do this because they afraid to piss off the AMD Users

3 AAA games use it, AMD has this feature on DX12.

Also we plan to use it to do CSG in a single pass for a CAD app, at this rate we'll open a popup saying "Buy a real GPU" and open a browser with Amazon and Ebay searches for Nvidia and Intel when we detect an AMD GPU.

@Triang3l
Copy link

Triang3l commented Jan 10, 2023

Yes, and I did more research recently for my future blog post — and the deterministic ordering, and thus the lack of temporal noise, in overflow handling makes fragment shader interlock a much more reliable solution even to order-independent transparency compared to other two-pass methods like with a spinlock. This was also cited in the GRID 2 article (for them even 2 nodes per pixel were enough for order-independent transparency, and 4 nodes for Adaptive Volumetric Shadow Maps for smoke lighting), in the MLAB benchmark. Without fragment shader interlock, no matter how advanced your tail blending algorithm is, it will always be broken anyway — because you'll just have incoherent noise if any overflow happens. Like an analog TV with no antenna connected, on your trees or in your glass panes.

Additionally, with fragment shader interlock you can do OIT partially, coarsely sorting large batches of geometry (level map tiles, objects, meshlets), and doing fine OIT inside those batches and between nearby batches, including to handle intersecting polygons (which are very common in foliage). And if you sort batches by the farthest depth in them (conservatively is enough), with fragment shader interlock, you can compare the sort key of the current batch with the closest OIT fragment depth in the pixel so far — and if it turns out to be closer, with fragment shader interlock, you can just safely resolve OIT as soon as that happens and free all your OIT layers for reuse (without causing any pipeline stalls for pixels that don't need OIT, unlike an explicit resolve pass, even if stencil-masked, with pipeline barriers — which also wouldn't work with instancing or mesh shading) as you know that all new fragments will be closer from that moment on. This can effectively provide you infinite layers in the view, with a small number of layers within object "clusters" needed in the RAM.

Other uses that come to mind are deferred decals — blending into the normal G-buffer, as demonstrated by Nvidia (especially useful for decals on curved surfaces); or drawing huge numbers of sorted particles with a custom blending equation (like Hard Light for both lightening and darkening), as well as with per-particle blending equation selection (especially useful with bindless textures — to have all the additive fireworks and all the alpha-blended smoke in a single draw command with correct ordering between each other — and you can't just put fireworks in one draw command and smoke in another, as you wouldn't be able to mix ordering of the particles between the two).

On the implementation side, by the way, I'm somewhat worried about the changes to POPS setup introduced by GFX11. Specifically, the POPS_OVERLAP_NUM_SAMPLES setting is now gone. Currently it's difficult for me to allocate the money to purchase a testing device, so I can't check this by myself. But can someone (@Anteru possibly?) please confirm, how does POPS behave on RDNA 3 with sample-rate shading?

Direct3D and Metal (and Intel's old extension) only require sample-granularity interlocking with sample-frequency shading. However, Vulkan and OpenGL fragment shader interlock gives explicit control of the interlock scope to the shader via its execution mode — so it's still possible to request pixel-level interlock, which offers wider guarantees, in a sample-rate shader (like via POPS_OVERLAP_NUM_SAMPLES = 0 on Vega/RDNA/RDNA2). And if the device supports the fragmentShaderPixelInterlock feature, it must expose whole-pixel-scope interlock regardless of the shading frequency of the shader. However, without the fragmentShaderPixelInterlock feature, it's not possible to support ROVs in DXVK and VKD3D at all, as PixelInterlock execution modes wouldn't be usable even conditionally, while PixelInterlockOrdered would be required for multisampling without sample-rate shading — even though in native Direct3D ROVs would just work naturally.

@TheLastRar
Copy link

You briefly mentioned custom blending equations, and many have also mentioned wanting this extension for programmable blending, it's possible to use feedback loops for that purpose https://registry.khronos.org/vulkan/specs/1.2-extensions/html/vkspec.html#renderpass-feedbackloop

Mesa zink uses this to implement fbfetch (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12603), which Mesa uses to support the GL_KHR_blend_equation_advanced extension.

PCSX2 uses it to support non-standard blend modes. This requires splitting render passes and using barriers (with VK_DEPENDENCY_BY_REGION) to ensure sync. Performance is reasonable on Nvidia (even with a very large number of draws), however, AMD doesn't fully support VK_DEPENDENCY_BY_REGION and has worse performance as a result, see this Reddit post & AMD Community post, which seems to be a hardware limitation.

I don't know if OIT can be supported by the above approach. and I don't know how performance compares to using shader interlocks w.r.t. custom blending.

@Triang3l
Copy link

Triang3l commented Jan 11, 2023

@TheLastRar Fragment shader interlock and shader framebuffer fetch (not an explicit barrier that causes excessive synchronization — especially on AMD which doesn't have BY_REGION as far as I know, but even on mobile tiled GPUs what's basically expected from BY_REGION at worst is simply not flushing/reloading tile memory, making the barrier just tile-local rather than global — and doesn't support draw commands with overlap, let alone intersecting primitives; but Arm's ordered version instead) both can be used to implement programmable blending, however, they're quite different in the details, so it would be ideal if hardware supported both, but with Intel being the only PC graphics card developer supporting the latter currently (while all the biggest 3 have fragment shader interlock in their hardware and at least some of their drivers), FSI is effectively the only option on the PC now.

But in general, fragment shader interlock offers massively more flexibility than shader framebuffer fetch.

The only advantage of SFBF I can imagine is that it supports late depth/stencil test, so it works directly with things like alphatested surfaces. With fragment shader interlock, the write happens in the shader, so your only choices are early depth/stencil (with post-depth coverage with MSAA), which only works for opaque surfaces not modifying the depth or the stencil reference from the shader, or full-blown software depth testing.

However, FSI, being a shader part rather than an output-merger one, allows for arbitrary addressing, and that removes lots of limitations:

  • The layout and the amount of pixel-local data can be managed completely freely. SFBF restricts you to (R32G32B32A32 * maximum output attachment count) of data at most, and if you use more than one attachment, your data becomes scattered across multiple image subresources. With FSI, you can store arbitrary amounts of data with any layout you want.
  • Random access within pixel-local data. Relative addressing of temporary registers is a painful thing, on AMD for instance the offset is placed in a scalar register, not a vector one. But more importantly, you don't always need to access all the per-pixel data. With SFBF, if you want to modify data at some dynamic location, you need to fetch all the data, and then to export it back. An example of when this may happen is order-independent transparency — if there's no overflow, you don't need to tail-blend and thus to locate the least important fragment to replace throughout all the pixel, you only need to increment the counter and to add the new fragment, so there's no need to waste the bandwidth reading and writing the entire list. Another example would be single-pass construction of trees, like for the Z axis when doing 2D-tiled light clustering with conservative rasterization — again, you only need to visit the nodes that matter for the new light.
  • Not only the inner layout, but the global placement of the pixel-local data is controlled by the application. The data doesn't have to be placed in a rectangle in one or multiple image subresources. This is important for emulation, including Xenia in particular, where we emulate EDRAM addressing of the console, thus data reinterpretation (which happens a lot on the console, even for clears that are done via 4x MSAA depth-only rectangle draws there) between different framebuffer offsets, widths, MSAA sample counts, bit depths, can be done with no copying — just a pipeline barrier is sufficient.

Also I'm not entirely sure about the requirements of Arm's rasterizer-ordered attachment access (maybe subpassInputMS is allowed, but I don't know for sure), but compared to OpenGL ES SFBF, FSI has one advantage for MSAA — you still can use pixel-frequency shading with FSI, and access per-sample data based on the input coverage mask with sample interlock, or per-pixel data as usual with pixel interlock. The OpenGL ES SFBF specification says: "Reading the value of gl_LastFragData produces a different result for each sample. This implies that all or part of the shader be run once for each sample…", but in reality as far as I know it's always full fallback to sample-rate shading, which cancels out the idea of MSAA on the performance side.

Note that with FSI, you can still take advantage of texture tiling (if I understand correctly, it's even the same 64KB_R_X for both framebuffer attachments and storage images on RDNA and RDNA 2 normally), and as far as I know, modern AMD GPUs support internal compression for storage images as well.

But again, what's the most important is that there's no SFBF anywhere on the PC except for Intel GPUs (and maybe Innosilicon, Moore Threads, though I don't know the details about them).

@devshgraphicsprogramming

umbers of sorted particles with a custom blending equation (like Hard Light for both lightening and darkening), as well as with per-particle blending equation selection (especially useful with bindless textures — to have all the additive fireworks and all the alpha-blended smoke in a single draw command with correct ordering between each other — and you can't just put fireworks in one draw command and smoke in another, as

Another fun use for ROV is rendering to and blending non-renderable formats like RGB9E5 or some custom stuff.

@devshgraphicsprogramming

You briefly mentioned custom blending equations, and many have also mentioned wanting this extension for programmable blending, it's possible to use feedback loops for that purpose https://registry.khronos.org/vulkan/specs/1.2-extensions/html/vkspec.html#renderpass-feedbackloop

Mesa zink uses this to implement fbfetch (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12603), which Mesa uses to support the GL_KHR_blend_equation_advanced extension.

PCSX2 uses it to support non-standard blend modes. This requires splitting render passes and using barriers (with VK_DEPENDENCY_BY_REGION) to ensure sync. Performance is reasonable on Nvidia (even with a very large number of draws), however, AMD doesn't fully support VK_DEPENDENCY_BY_REGION and has worse performance as a result, see this Reddit post & AMD Community post, which seems to be a hardware limitation.

I don't know if OIT can be supported by the above approach. and I don't know how performance compares to using shader interlocks w.r.t. custom blending.

IIRC unless you have the brand new Vulkan ARM/EXT externsion meant to replace OpenGL ES Pixel Local Storage or FramebufferFetch.... subpass feedback loops are only limited to a single pixel overwrite cycle, then you need a barrier (can be by-region).

PLS + PSI are best of both worlds, because you can use the local framebuffer/tiler memory to store your MLAB4 buckets and not a coherent storageImage.

@Triang3l
Copy link

Triang3l commented Apr 1, 2023

I guess this is not an April Fools' Day joke? 😜 Even though this situation is complete tragicomedy and farce 🤷‍♂️
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250

@John-Gee
Copy link

John-Gee commented Apr 1, 2023

I guess this is not an April Fools' Day joke? stuck_out_tongue_winking_eye Even though this situation is complete tragicomedy and farce man_shrugging gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250

Fantastic news, thank you for your work on this!

@SopaDeMacaco-UmaDelicia
Copy link

I guess this is not an April Fools' Day joke? stuck_out_tongue_winking_eye Even though this situation is complete tragicomedy and farce man_shrugging https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250

What a chad 💪😎

@rijnhard
Copy link

It just got merged into Mesa RADV devel

@r2rX
Copy link

r2rX commented Jun 26, 2023

Congratulations, @Triang3l, and well done. AMD users are indebted....only appreciation and admiration for this awesome contribution. :)

@oddMLan
Copy link

oddMLan commented Jun 26, 2023

Brilliant! The FOSS community has done it again 🎉
I guess this goes to show support on Windows is possible, right?
...........
What do you mean there aren't any open source drivers on Windows?
...........
What do you mean AMD has to implement it? But they just said they won't! ☹️

https://www.youtube.com/watch?v=Lo4DMz6fZG0

@Moonlacer
Copy link

Hello AMD Vulkan developers! Has your stance on supporting this extension changed within the last 4 years? There seems to be plenty of examples given here on how this would affect the user experience on any AMD card using the Windows proprietary drivers, so I would really like to know your current (updated) thoughts on this matter.

Best regards, Moonlacer

@Squall-Leonhart
Copy link

VK_EXT_fragment_shader_interlock has been added to amdvlk in 194a181da7e2cca5f70ec0f9e65119955b3d2b47

@RinMaru
Copy link

RinMaru commented Dec 30, 2023

VK_EXT_fragment_shader_interlock has been added to amdvlk in 194a181da7e2cca5f70ec0f9e65119955b3d2b47

Thats not windows though is it?

@John-Gee
Copy link

John-Gee commented Dec 30, 2023 via email

@mirh
Copy link

mirh commented Dec 30, 2023

The same commit also added VK_KHR_maintenance5 (which did in fact land in 23.12.1)
Shader interlock is still nowhere to be seen though (it isn't exposed yet with availableExtensions.AddExtension)

@Triang3l
Copy link

VK_EXT_fragment_shader_interlock has been added to amdvlk in 194a181da7e2cca5f70ec0f9e65119955b3d2b47

That commit doesn't implement it, only makes it "known" to the extension management in the driver, and the newly added device feature query reports that all interlock features are unsupported. But it's an awesome sign that it's soon™! Unfortunately no new AMDVLK will be released for the Vega generation though.

@Squall-Leonhart
Copy link

VK_EXT_fragment_shader_interlock is added as of Adrenaline 24.12.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests