Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose primitive ordered pixel shaders #108

Closed
Degerz opened this issue Aug 16, 2019 · 121 comments
Closed

Expose primitive ordered pixel shaders #108

Degerz opened this issue Aug 16, 2019 · 121 comments

Comments

@Degerz
Copy link

@Degerz Degerz commented Aug 16, 2019

According to the Vega ISA documentation, this feature uses the SOPP scalar microcode format. Currently, this feature is only exposed in AMD's D3D12 drivers as "rasterizer ordered views" so I'd like to see a mirror equivalent supported in Vulkan as well known as VK_EXT_fragment_shader_interlock.

We need this feature to emulate a certain PowerVR GPU for our use case and particularly we want the fragmentShaderPixelInterlock feature exposed from the extension so can your team enable this for the Vulkan drivers ? (Bonus if the team can also get fragmentShaderSampleInterlock exposed too) Also if you are working on this extension can we also get an estimate of when a driver will be released to support the extension/feature ?

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 23, 2019

Can I get a response from the team ?

@jinjianrong
Copy link
Member

@jinjianrong jinjianrong commented Aug 23, 2019

Our stance is that we don’t want to implement it. It messes with subpasses and is not the right formulation

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 23, 2019

What exactly do you mean by "not the right formulation" ? Is this extension somehow the wrong abstraction to map to this feature inside the hardware ?

If so, is there a better way to expose it like a potential framebuffer fetch extension ? (I don't think AMD HW supports multiple render targets so things could get sketchy)

How is the interlocks extension different compared to your "primitive ordered pixel shaders" ? Because we badly need this extension or something similar to a framebuffer fetch.

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 23, 2019

How would you like to proceed with this ?

Can we at least get a vendor extension from AMD exposing this directly if your team doesn't like how interlocks are specified or would you prefer to close this issue if you have no intention of exposing similar functionality in Vulkan ?

I am requesting this because there's arguably a stronger case to have ROV-like functionality exposed in Vulkan rather than in D3D12 because there's a higher interest from open source projects in using it than AAA game engine developers.

@oscarbg
Copy link

@oscarbg oscarbg commented Aug 24, 2019

@Degerz @jinjianrong it's sad to see AMD has no plans to support this extension on AMD Vulkan driver even now that VK_EXT_fragment_shader_interlock is a de facto "standard" by the fact that is supported by all other vendors (NV &Intel) and all other OSes (Windows & Linux):
you can see on Windows is supported on "recent" NV (>=Maxwell) & Intel GPUs (>=gen9 Skylake):
https://vulkan.gpuinfo.org/listdevices.php?platform=windows&extension=VK_EXT_fragment_shader_interlock
similarly on Linux supported on NV&Intel:
https://vulkan.gpuinfo.org/listdevices.php?platform=linux&extension=VK_EXT_fragment_shader_interlock

heck even since Metal2.0 on MacOS we have support for exact same feature.. on Vega cards on Imac Pro we get "RasterOrderGroupsSupported"..

adding more use cases:
should be useful for VKD3D project for supp. D3D12 "rasterizer ordered views" in case any D3D12 games use it..

well in fact, Xenia emulator D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.. also Xenia is working towards adding Linux support:
xenia-project/xenia#1430
and has a more immature Vulkan backend that Linux backend would use..
Xenia VK backend could take advantage of VK_EXT_fragment_shader_interlock for better/faster emulation of Xbox hardware..
so joining @Triang3l to discussion in case it wants to discuss further..

EDIT:
even it could be supported on MacOS MoltenVK Vulkan driver so I asked for it:
KhronosGroup/MoltenVK#630

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 24, 2019

@oscarbg Good idea to get more people interested in this functionality, I think I'll do the same as well! While you're at it, can you go request other AMD engineers like @Anteru on twitter to show that the community wants this functionality as well on AMD HW on Vulkan.

cc @tadanokojin @hrydgard @pent0

The above have actively expressed interest and/or already using functionality similar to shader interlock in their projects. One of their main motivations to using Vulkan is getting access to modern GPU features like interlock so we'd prefer it from AMD if we didn't have to move over to platform specific APIs like Metal or D3D12 to be able to use this feature!

Vulkan subpasses are possibly not powerful enough for their purposes. I don't care if AMD doesn't ever expose VK_EXT_fragment_shader_interlock but please at least give another viable alternative for their sake even if it is an AMD specific extension!

@pent0
Copy link

@pent0 pent0 commented Aug 24, 2019

I just want to do programmable blending. If you guys can provide another primitives it would also be ok, but this is best. Texture barrier (for opengl) is what I am using but not the fastest path really (also for vulkan if appliable). I dont really know how you guys would do it though

@ryao
Copy link

@ryao ryao commented Aug 24, 2019

How would you like to proceed with this ?

Here is my suggestion. Use the extension and tell users to switch to either Intel or Nvidia graphics hardware because AMD refuses to support the extension and cite this issue. Watch AMD backpedal on this very quickly after an executive hears about the situation.

Also, ask the RADV developers to implement support so that Windows users who want to use software that depends on it have the option of switching to Linux for it.

@Triang3l
Copy link

@Triang3l Triang3l commented Aug 24, 2019

From the passes point of view, what's different in this from regular image/buffer stores?

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 24, 2019

@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ...

@Triang3l Here are some insights from Sascha. Along with the stated limitations, I do not think that Vulkan subpasses are capable of handling self-intersecting draws just like OpenGL's texture barrier.

@RussianNeuroMancer
Copy link

@RussianNeuroMancer RussianNeuroMancer commented Aug 24, 2019

Seems like an unlikely scenario that it would reach to the very high echelons in the company

News article on Phoronix could help with this a bit.

@ryao
Copy link

@ryao ryao commented Aug 24, 2019

@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ...

All that you need is for end users to start telling each other that AMD graphics hardware is not friendly to emulators after they start asking why it doesn’t work on AMD graphics hardware. It will reach the upper echelons when they are trying to figure out why they did not meet their sales projections.

As for the mesa developers, they might not know that this extension has any use cases. I was under the impression that those working on RADV were volunteers, so if you don’t ask them about it, they seem less likely to implement it.

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 24, 2019

@ryao TBH, I feel it is more constructive for developers like @pent0 to just express their desire to expose this feature and just list out their use cases instead ...

At the end of the day, advanced system emulation doesn't even account for the fraction of AMD's customers and the emulation community is already aware that AMD has a checkered history with them so the leading hardware vendor is already favoured over there.

I'd prefer it if we can show that their driver manager's position is out of touch with the community's position because unlike with higher ups such as executives there's no guarantee that they'd understand this issue or that they'd be specialists regarding GPU drivers to help us out.

@pent0
Copy link

@pent0 pent0 commented Aug 24, 2019

I love you guys. I know you can do it however the hard it's. Go go go! We all want this feature.

Also programmable blending is not what emulators want also, it's also what many game developers desired to achieve nice and godly effect on their game, which fixed blending can not do. I can't bring an example for PC but here is an example one doing on Metal IOS.

Programmable shader pipeline has been here for 15 years, so programmable blending should be too. Its the defacto nowadays.(I copy this quote from this article).

I am really bad at wordings hehe, I just express what many people want. Reconsider please :)

@ReaperOfSouls1909
Copy link

@ReaperOfSouls1909 ReaperOfSouls1909 commented Aug 25, 2019

AMD would be an amazing company if only you stopped making horrible choices and drivers what a shame

@jarrard
Copy link

@jarrard jarrard commented Aug 25, 2019

Who exactly does this affect anyway? just PowerVR GPU users? if so I can understand why AMD doesn't want to dedicated valuable development time to this endeavour. Nothing is stopping the community adding it themselves thanks to OPEN-SOURCE drivers!

@jarrard
Copy link

@jarrard jarrard commented Aug 25, 2019

Who exactly does this affect anyway? just PowerVR GPU users?

In relation to the topic and example given! How is this not obvious?

PowerVR GPU

Then that's not the best example, it would have been better to give examples that are not fringe case but more common use.
Also referring to people as retarded is why stuff like this gets ignored, its quite a anti-open-source attitude to have, and only derails things.

Take a chill pill mate!
THE END

@pent0
Copy link

@pent0 pent0 commented Aug 25, 2019

Its not only relevant to the emulation, its related to everything. DXVK needs interlock to implement ROV, games developer need it to do programmable blending.

It affects many things hence this extension exist. Please think more.

@jfdhuiz
Copy link

@jfdhuiz jfdhuiz commented Aug 25, 2019

@pent0 I get that you are angry. If you feel misunderstood, express that feeling. Make your points and cut out the strong language. Strong language won't help your cause (for innocent bystanders it looks like you're grasping at straws), and it is disrespectful. Your point is much, much stronger without the strong language.

@pent0
Copy link

@pent0 pent0 commented Aug 25, 2019

@jinjianrong
Copy link
Member

@jinjianrong jinjianrong commented Aug 25, 2019

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

Additionally, this is an inefficient method of performing the typical thing it's often advocated for - order independent transparency. For such an effect we would usually recommend using one of the many two-pass OIT algorithms out there, and making use of multiple subpasses, with the second pass doing the resolve. This is likely the most portably efficient mechanism you can use that works between desktop and mobile parts. We're thus not inclined to support it, as we'd rather not promote an inefficient technology.

However, if you're looking to do direct emulation, we are not sure that really helps you - perhaps you could elaborate on what it is you're trying to emulate exactly and we may be able to advise on an alternative method?

@MojoJojoDojo
Copy link

@MojoJojoDojo MojoJojoDojo commented Aug 25, 2019

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of of demos applications

It's being used in Just Cause 3 and GRID 2.
https://software.intel.com/en-us/articles/optimizations-enhance-just-cause-3-on-systems-with-intel-iris-graphics

https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization

@illusion0001
Copy link

@illusion0001 illusion0001 commented Aug 25, 2019

Quote from @oscarbg
Xenia (Xbox 360 Emulator) D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.

@pent0
Copy link

@pent0 pent0 commented Aug 25, 2019

I don't know much about these stuffs, so I will let the guys who know discuss. I will try to get a workaround for now.

Hi, in our case, we are trying to emulate a feature from the PowerVR GPU. It's that you can fetch the last fragment data of a texel in color buffer and use it for blending inside the fragment shader. It's like blending but not fixed, but inside the shader (programmable).

For OpenGL on AMD, we are using texture barrier. On Vulkan I'm not sure if that's available (the only thing I know is pipeline barrier so far, but I will look more). What would you advice me to do in this case?

Edit: @Degerz was asking for our case, thanks! I was not aware of you asking this before you ping me.

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 25, 2019

@jinjianrong Thank you very much for the response!

Support for interlocks/ROVs aren't that compelling in D3D because engine developers are more interested in targeting higher hardware compatibility than using the latest features like I mentioned. By comparison, there are already open-source desktop(!) OpenGL applications out there that are already using interlocks or framebuffer fetch and we would like to be able to target both both Windows and Linux on AMD's Vulkan drivers.

Also, we don't want to implement order independent transparency with Vulkan subpasses. We want to have the same capability to do programmable blending for emulation purposes and for this reason alone Vulkan subpasses are not a powerful enough mechanism for this purpose since it possibly(?) can't handle self-intersecting draws like we see with texture barrier. I understand from the hardware people's point of view that primitive ordered pixel shaders can place an ordering constraint when executing fragment shaders and that certainly has undesirable effects in terms of increased latency due to the stalling it causes.

This feature helps us to emulate systems that have non-standard fixed function blending pipelines and systems that are capable of programmable blending as well via shader framebuffer fetch. The biggest reason why interlocks/ROVs/fetch have an advantage over Vulkan subpasses are because the latter does not cover the edge case of self-intersecting geometry and thus gives in our case incorrectly rendered content!

Edit: If I had to rate the severity for the lack of this feature it would almost be as bad as not having transform feedbacks/stream-output available for DXVK and your team also added support for this fundamentally hardware unfriendly feature as well so just like with transform feedbacks we need to also cover some more cases with interlocks as well even if it does have undesirable performance characteristics.

@hrydgard
Copy link

@hrydgard hrydgard commented Aug 25, 2019

As another data point as the author of PPSSPP, the popular Sony PSP emulator, the PSP also has a few blend modes that cannot be replicated without fragment shader interlock or similar programmable blending functionality. Now, games don't actually use them much and don't generally use them for self-overlapping draws, so framebuffer copies work in practice to emulate them, but for fully hardware-accurate emulation this would be useful.

I'm not directly involved with Xenia, have only followed its development from the side, but it needs this functionality to simulate some framebuffer formats that only exist on the Xbox 360 and are heavily used by games. They're not practically feasible to emulate in other ways,

@Triang3l
Copy link

@Triang3l Triang3l commented Aug 25, 2019

@jinjianrong Xenia needs this for pretty much everything in render target emulation:

  • Blending with piecewise linear gamma, with float7e3.7e3.7e3.unorm2 color format, with exponent-biased 16-bit snorm format (with -32 to 32 range).
  • Float20e4 depth (especially important when games do EDRAM–RAM–EDRAM round trips (GTA IV, Halo 4) and in this case 32-bit floats cannot be used as games reupload the depth buffer to the EDRAM by writing 24-bit depth to GBA of a R8G8B8A8 color render target aliasing the depth/stencil buffer totally destroying invariance). Of course it's not as fast without all the hi-Z, compression and true early Z, but it's more or less acceptable, and with copying to allow for aliasing we would still lose the first two.
  • Fast aliasing without copying (which is also inaccurate as there are some draws without viewport/scissor where we can't even determine the height of the render target — like drawing a DI_PT_RECTLIST to a single render target at RB_COLOR_INFO::EDRAM_BASE 0 (so nothing that would naturally truncate it) with a custom vertex shader and vfetch layout) — aliasing happens a lot even in cases like clears, which are usually done via a 4x MSAA depth buffer even for single-sampled color buffers.
  • MSAA via ForcedSampleCount (without that on GPUs without SV_StencilRef we can't restore the stencil buffer after aliasing and have to fall back to slow SSAA).
@gharland
Copy link

@gharland gharland commented Aug 25, 2019

The unordered variant of this extension is essential for voxelization/global illumination/volumetric rendering. Otherwise what option is there for avg or max blending voxels other than clunky atomiccompswap spinlocks? Couldn't we at least have the unordered variant? Even if there were no use case why can't developers just have another tool in the tool box for coming up with new algorithms?

The extension is also requested here, please come over and register your interest.

https://community.amd.com/message/2927066

https://community.amd.com/message/2926956

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 25, 2019

@jinjianrong I'm sure you've realized it by now but our stance on this issue as a community is non-negotiable so we do not desire to seek the 'alternatives' that you speak of.

I understand the anxiety your team is facing right now since they're going to expose an unfriendly hardware feature and if you absolutely cannot have the general public wanting to access this feature then I have a solution which is applying whitelists to these community projects specifically to be able to access this feature in the driver.

Is whitelisting certain applications a viable solution at your end for our case ?

@Triang3l
Copy link

@Triang3l Triang3l commented Aug 25, 2019

@Degerz Whitelisting will make adding this to new projects impossible, it would never exist in Xenia if we had to go through any procedure of being added to a whitelist (and ROV usage there began as an experiment anyway), and that's the opposite of how PC gaming works.

@phire
Copy link

@phire phire commented Aug 28, 2019

@jpark37 Correct, that's what I'm suggesting in my edit

@Degerz My understanding is that shader cores (nvidia, amd and intel) are totally capable doing indirect branches and have been able to for years. It's just not really exposed to pixel shaders. OpenGL 4.0 does have ARB_shader_subroutine extention, but that only exposes uniform control flow that is static per draw call.

The downside is the whole wave/wrap follows the branch. To do dynamic per thread indirect branching, you would have to disable lanes and loop, executing upto 64 different subroutines before continuing. Performance would be very dependant on how often threads branch to the same blend shader.

But I think GCN and later is technically capable of this. I wonder if Nvidia added hardware to Tesla to accelerate this operation?

Also, I think you would be surprised at the performance of literally writing a bytecode interpeter in your pixel shader and compiling blend programs to bytecode which you store in uniforms. Keep the bytecode simple, only the operations you need and only four or eight registers, spill any extras to the stack. Your resolve shader will just use a switch statement and bunch of dynamic indirect array accesses to execute it.

It won't be as fast as the option above, but potentially competitive, within a order of magnitude.
More importantly, it doesn't require the development of any special extensions and works across many GPUs (though watch out for dynamic array indexing performance on Nvidia, they don't have an "index into array of registers" operation like most other vendors)

@jpark37
Copy link

@jpark37 jpark37 commented Aug 28, 2019

Another idea, how about dividing up the index buffer into indirect draws?

Example:
10 triangles, 30 indices, Triangles 4, 6, and 7 have overlap.

Do a triangle list draw that does just enough work to store enough info to know which primitives overlapped: vkCmdDrawIndexed(commandBuffer, indexCount=30, instanceCount=1, firstIndex=0, vertexOffset=0, firstInstance=0);

Do some sort of pass that takes that information and builds a list of indirect draws:
VkBuffer argumentBuffer, sized to maxDrawCount: DrawIndexed [0:17], DrawIndexed [18:20], DrawIndexed [21:29]
VkBuffer countBuffer: 3

Set up regular draw state, and do multi-draw indirect:
vkCmdDrawIndexedIndirectCountKHR(commandBuffer, argumentBuffer, 0, countBuffer, 0, maxDrawCount, stride);

Initial questions:

  • What's the best way to build the overlap info structure?
  • How best to convert that info into an array of indirect draw arguments?
  • What should maxDrawCount be sized to? What to do if it's short?
  • Can this work for other types of draws?
  • What's the support level of vkCmdDrawIndexedIndirectCountKHR?
@oscarbg
Copy link

@oscarbg oscarbg commented Aug 28, 2019

Just two additional cents:
@Degerz if ROV use cases on Vulkan end using VK_KHR_shader_atomic_int64 then we will have to ask Intel Windows Vulkan driver team to implement it (or use interlock EXT on this case).. available everywhere else on Linux and Win..
EDIT: Intel VK Win VK_KHR_shader_atomic_int64 support is possible as supported by Anvil now..

in the light of shared tests where AMD use of ROV causes a 20X slowdown wondering if AMD big slowdowns are really to architecture not being so friendly to ROV as NV and Intel or really they have to additionaly implement to software workarounds to HW bugs in ROV case than cause more slowdowns than would be needed either way.. I may be wrong but one hint may be Xenia D3D12 ROV has notable graphical issues running Read Redemption on Vega the two last times I tested (and there ware 6 good months between).. maybe @Triang3l can share thoughts on AMD rendering bugs on Xenia D3D12 ROV on Vega at least.. and if anybody knows if new Navi 5700 cards are known to render correctly on Xenia D3D12 ROV case..

@oscarbg
Copy link

@oscarbg oscarbg commented Aug 28, 2019

hey guys.. good news to share: we will have interlock ext supported on MacOS MoltenVK very soon:

KhronosGroup/SPIRV-Cross#1138

so even AMD GPUs on MacOS (Vega only right now) will support it..
maybe interesting if DXVK or VKD3D gained support for ROV to run the Intel demos mentioned earlier to see if also the big slowdowns on AMD GPUs stil are there on Metal..

@Degerz
Copy link
Author

@Degerz Degerz commented Aug 29, 2019

Also, do we tell them to just sort the linked lists by primitive id and store the shadow map framebuffer combiner state too as a solution ?

Attempting a bytecode interpreter for our case to generate blend shader programs could be something to consider in the future once system configurations with GPUs capable of 10+ TFlops/500+ GB/s become common enough which won't be too far off ...

@MadByteDE
Copy link

@MadByteDE MadByteDE commented Sep 6, 2019

Its not only relevant to the emulation, its related to everything. DXVK needs interlock to implement ROV, games developer need it to do programmable blending.

Hey guys, I'm just a regular consumer and would like to know if the stuff you're discussing here may be the cause for this kind of artifacting seen on Navi / Raven Ridge in games utilizing DXVK. Seems like the dev's not gonna answer it any time soon.. so.. This could prevent new ppl from creating more issues for this over - and over - again..

@Joshua-Ashton
Copy link

@Joshua-Ashton Joshua-Ashton commented Sep 7, 2019

@MadByteDE No.

@mcoffin
Copy link

@mcoffin mcoffin commented Sep 7, 2019

@MadByteDE nope (I think) this is just about how to expose some new ngg capabilities

@ReaperOfSouls1909
Copy link

@ReaperOfSouls1909 ReaperOfSouls1909 commented Sep 9, 2019

The emulator PCSX2 also plans on using this so lots want it and it be good to have it

@amayra
Copy link

@amayra amayra commented Jan 30, 2020

I guess my next GPU is Nvidia then ?

@jarrard
Copy link

@jarrard jarrard commented Jan 30, 2020

I'm sure somebody will step up and produce some code towards making it happen. AMD drivers are open source afterall. Nvidia on the other hand, no chance.

@Joshua-Ashton
Copy link

@Joshua-Ashton Joshua-Ashton commented Jan 30, 2020

@jarrard I really don't think you're going to find anyone who would want to spend time adding it to AMDVLK that isn't at AMD.

@jarrard
Copy link

@jarrard jarrard commented Jan 31, 2020

maybe not AMDVLK, but possibly RADV driver

@oddMLan
Copy link

@oddMLan oddMLan commented Feb 8, 2020

@jinjianrong

Our stance is that we don’t want to implement it. It messes with subpasses and is not the right formulation

WTF? GL_INTEL_fragment_shader_ordering worked perfectly fine in the 17.x Radeon drivers. Actually it's a requirement for certain emulation software. You're telling me you don't give us any alternative for fragment interlocking for OpenGL?. Seems that you only care about propietary APIs (DirectX). Your OpenGL drivers suck big time. Literally everyone agrees on that point.

https://github.com/PCSX2/pcsx2/wiki/OpenGL-and-AMD-GPUs---All-you-need-to-know
https://dolphin-emu.org/blog/2013/09/26/dolphin-emulator-and-opengl-drivers-hall-fameshame/

Maybe if I told everyone I know to not touch AMD GPUs with a 10 ft pole, and see your sales dwindle, you'll start wanting to implement it. Oh yeah but big thanks for a 3% speedup in ceirtain PC games with each 1.2GB driver update, while your OpenGL support remains terribly pitiful.

@amayra
Copy link

@amayra amayra commented Apr 13, 2020

Friendship ended with ayymd
Now NVIDIA is my best friend

@SaltyBet
Copy link

@SaltyBet SaltyBet commented Aug 4, 2020

As a heads up, official AMD ROV support might be coming (I'm guessing post-Navi):

AMD ROV Patent US20200202815A1

@Joshua-Ashton
Copy link

@Joshua-Ashton Joshua-Ashton commented Aug 5, 2020

@SaltyBet That was filed in 2018.

@RinMaru
Copy link

@RinMaru RinMaru commented Aug 5, 2020

yea its not coming. already got allot of Emu devs looking for alternatives in fear of pissing off AMD users

@gharland
Copy link

@gharland gharland commented Aug 5, 2020

I only skim read it, correct me if I'm wrong, but it looks like a software solution that would be just as slow as a roll-your-own.

A fast non-order-dependant critical section would still be nice.

@RinMaru
Copy link

@RinMaru RinMaru commented Aug 5, 2020

I only skim read it, correct me if I'm wrong, but it looks like a software solution that would be just as slow as a roll-your-own.

A fast non-order-dependant critical section would still be nice.

That would be Per-Pixel-Linked OIT iirc its been done on Redream and other DC Emus as of recently to basically work around the issue

@JacobHeAMD JacobHeAMD closed this Aug 19, 2020
@Joshua-Ashton
Copy link

@Joshua-Ashton Joshua-Ashton commented Aug 19, 2020

Why is this closed? The issue is still not resolved.

@RinMaru
Copy link

@RinMaru RinMaru commented Aug 19, 2020

Why is this closed? The issue is still not resolved.

Because AMD isnt going to resolve it its a feature that is hardly used outside the Emulation community. Some devs are looking at other less faster ways to do this because they afraid to piss off the AMD Users

@Joshua-Ashton
Copy link

@Joshua-Ashton Joshua-Ashton commented Aug 19, 2020

You could work around it for programmable blending if the resource is in GENERAL layout (ie. no DCC) and you emit a readback barrier and sample the current framebuffer's image as a normal image and blend that way.

@Triang3l
Copy link

@Triang3l Triang3l commented Aug 19, 2020

So I assume the only option we have on Vulkan on AMD is a "mutex buffer" with an R32_UINT per pixel + atomic CAS spinlock, though I'm not sure if that would preserve the order of polygons, especially translucent ones with programmable blending (for opaque, apart from manual depth testing, a "primitive index" buffer could possibly be used, rejecting if new draw–instance–primitive index < last written index, but with blending there's no 1:1 association between a pixel/sample and a primitive/draw anymore, also not sure how wrapping could be handled).

The issue with the readback barrier partial workaround is that it needs to be placed in the command buffer, and thus can't work for self-overlapping draws.

Per-pixel linked lists and sorting by draw/instance/primitive during a resolve pass could work, but it would have a huge memory overhead, and it would impose a limitation on the number of overlaps, unlike what ROV or fixed-function blending provides.

@RinMaru
Copy link

@RinMaru RinMaru commented Aug 19, 2020

Per Pixel-Linked is what Redream does for DC's OIT and yea big memory overhead the higher the internal res is.

@Triang3l
Copy link

@Triang3l Triang3l commented Aug 19, 2020

@JacobHeAMD @jinjianrong Anyway arbitrarily dropping features that are "not recommended because of being inefficient" (poor architectural compatibility with subpasses is not a dead-end issue considering Intel and Nvidia are implementing this feature in Vulkan fine) sets a bad precedent and contributes to stagnation of the API (and graphics as a whole) and even more divergence between APIs and issues for those who want to support them all.

Any tool may be used well, just like any tool may be used badly, but it still has its uses. When we choose the tool to do the job, we know the goals and requirements of the task, the limitations that we can agree upon, and we evaluate the advantages of potential solutions and the drawbacks of each.

Let's take the debatable topic of antialiasing as an example. MSAA offers a sharp, stable (no "defocusing" every time a small movement happens) image, preserving small details, and also provides some transparency effects through sample masking, so it's very nice for vivid, sandbox-feeling, immersive games, and pretty much the only option for VR. However, it's noticeably more expensive, for this reason relatively rarely used in this generation (because of consoles, and because it's a bit complicated to integrate into a rendering pipeline with post-processing effects that use depth), and does not help with shading aliasing. Would those be good reasons to drop MSAA completely in the API even though your GPU supports it fine? Leaving developers with two choices — either blurry options eating small details (TAA, less blurry, but jaggy in motion — FXAA, MLAA, SMAA), or falling back to true supersampling, which would take us to the goal, but result in something even worse than what you wanted by removing MSAA, because that would have much higher performance costs. But if we have some milliseconds to spare on our target hardware, or can simplify some art, and not that many high-frequency low-roughness objects so specular aliasing is not a big issue (or can selectively supersample some parts of shaders or the frame, that cause the most aliasing, but not everything) — why not? MSAA would solve our goal perfectly considering our requirements and limitations. (I really hope you never ever treat this paragraph as a suggestion… at least you're still implementing EQAA even and are not advocating for reconstruction techniques like DLSS, so I guess here we're pretty safe.)

ROV is also a tool with good and bad sides, but those are factors to consider when using, not a reason to outright make the concept unusable. Yes, it has a flaw of being able to interlock only within individual pixels/samples and thus not letting users benefit from or recreate optimizations used in the fixed-function output-merger like color compression and sample deduplication. But we are aware of that, everything has limitations, rasterization has them, ray tracing has them. It's still a powerful and a valid option for various tasks.

If you're doing programmable blending or order-independent transparency, you don't have to use ROV for every single translucent effect in your frame — you can sort on a coarse level, and only use ROV for fine sorting within certain objects that need that, maybe through an additional framebuffer with premultiplied alpha (possibly with even lower bandwidth and memory usage than with per-pixel linked lists, and without potential overflow in case of large overdraw).

If, like in the emulation case, you're using it for pixel packing — first of all, it's the only solution (apart from TBR-like subpass input, which, however, turns multisampling into supersampling) that allows for maximum accuracy (and unlike, maybe, art fidelity, it's not just some subjective "looks good to me" thing that allows tradeoffs, either the original visuals are reproduced correctly, or it's just broken, in some games less, in some games totally). You may also have plenty of milliseconds to use if you're emulating some 2000s console on a 2020 system, so performance may just not be an issue for you (remember cycle-accurate bsnes and higan also). Actually in Xenia, the naïve purely ROV-based output path (even for depth/stencil) is significantly faster than the traditional render target-based one, since the latter involves a lot of copying to support reinterpreting EDRAM data. However, even with ROV, there's still space for optimization. A conservative host depth buffer may be added so true early Z can work (unlike discarding in the shader based on whether the manual depth test passed in the whole quad via ddx/ddy, which still requires the wave to be launched). Another possible optimization is using the fixed-function output-merger where suitable, and only using ROV for formats that don't exist on the PC or for parts of the frame requiring unusual blending — that's not uncommon practice in GPU design too, for instance, where you have fallbacks for cases when cmask/htile may become unusuable.

"Inefficient" is not absolutely measurable. It may be relatively less efficient in some cases, relatively more efficient in other, sometimes far more efficient, and it depends on what kind of efficiency you need in each individual case — efficient as in short frame time, or as in completely solving the problem, or even as in development time (Doom Eternal with its 500 FPS is more of an exception than a rule). And you don't need perfect efficiency in all cases, you just need the solution to be efficient enough for your requirements. ROV is efficient enough for us (including on AMD hardware where it's still faster than RTV/DSV in our case), and could work just fine. But now, instead of having it work just fine in our planned Vulkan version, we'll likely have to put a warning asking users to switch to the Direct3D 12 renderer, or to RADV if ROV support is present there, if they're using an AMD GPU. A situation with no winners. At least it's not some S3TC patent status that crippled the functionality for no good, but still there's nothing positive that comes out of simply removing useful tools.

Addition: Could you provide some clarification on "messes with subpasses"? Interlocks exist purely within the fragment shader stage of the pipeline, from the point of view of Vulkan resource dependencies, there are hardly any differences from using regular image/buffer load/store with shader atomics — which is handled fine by Vulkan (it's even stricter actually in the ROV case as you're not supposed to scatter when using a ROV). If the mapping between pixel/sample indices and image/buffer addresses changes, you need to insert what would be the Vulkan equivalent of a D3D UAV barrier into the command buffer — that's fine and make complete sense, as you're losing interlock-based synchronization for accesses through those addresses, thus you need synchronization on another (pipeline) level. The only thing I can think of in regards to how interlocks may interact with the pipeline is that there's no way in the extension to explicitly break interlocking between subpasses if you need that — but who really cares about such a tiny optimization just to reduce false positives (we're not getting false negatives, so interlocking is still fully functioning), which can be added in another extension anyway?

@Passingby1
Copy link

@Passingby1 Passingby1 commented Nov 20, 2020

@jinjianrong @JacobHeAMD
It's ok, really, our stance is that the green camp is a better all around choice for PC users that want to use their machines in whatever way they see fit.

Awesome job, never change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.