Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement VulkanRenderer #17

Closed
27 of 29 tasks
flibitijibibo opened this issue Apr 10, 2020 · 19 comments
Closed
27 of 29 tasks

Implement VulkanRenderer #17

flibitijibibo opened this issue Apr 10, 2020 · 19 comments
Assignees
Labels

Comments

@flibitijibibo
Copy link
Member

flibitijibibo commented Apr 10, 2020

The SPIR-V emitter for MojoShader is already done, so we just need to write the driver. It's possible that the emitter may need some modifications for uniform/attribute linking, but the bulk of the work was already done for ARB_gl_spirv support.

Some useful links:

Alpha:

  • Finish checklist in cosmonaut's tracker
  • Clean up any remaining validation errors in otherwise working tests
  • Clean up and squash/merge mojoshader_vulkan
  • Clean up and squash/merge initial VulkanRenderer

Beta:

Release Candidate:

  • Multithreaded Vulkan commands
  • Device check in PrepareWindowAttributes
  • Render pass clear optimization
  • Resolve all FIXMEs
  • Verify performance
    • Compare memory usage with OpenGL
    • Compare with NVIDIA OpenGL
    • Compare with Mesa radeonsi
    • Compare with ANGLE Vulkan
    • Compare with Metal on iOS

Vulkan By Default:

@flibitijibibo flibitijibibo changed the title Implement VulkanDevice Implement VulkanRenderer Apr 10, 2020
@flibitijibibo
Copy link
Member Author

Currently being done in #13

@thatcosmonaut
Copy link
Member

Remaining tasks are being tracked here:
thatcosmonaut#9

@flibitijibibo
Copy link
Member Author

flibitijibibo commented Jun 25, 2020

TODO list, since we're coming up to the first testable version...

SNIP Moved to the OP!

@flibitijibibo
Copy link
Member Author

Some notes for point rendering... in glspirv we have this patching setup when linking the shader program:

https://github.com/FNA-XNA/MojoShader/blob/upstream/mojoshader_opengl.c#L477

It's not too complicated but it does mean we have to hold off on building the shader module until we know that point sprite rendering is in use. This may end up being a part of the shader linker that solves the problem of linking vertex/pixel shader semantics together, instead of this disaster I came up with:

https://github.com/FNA-XNA/MojoShader/blob/upstream/profiles/mojoshader_profile_spirv.c#L1513

@flibitijibibo
Copy link
Member Author

flibitijibibo commented Jul 3, 2020

Some notes for threading support... we know two major limitations to Vulkan threading:

  • Command buffers/pools require a mutex when accessed from multiple threads
  • Certain commands cannot be executed during a render pass

To me, this says "we need a mutex that locks on both a render pass and on vkCmd execution." I'm thinking we have a mutex called "asyncCommandLock" that not only wraps around vkCmd calls, but also blocks for the entire duration of a render pass and for the entire duration that the command buffer is not recording. This sounds really broad, but this isn't as bad as it might seem:

  • For single-threaded games it means nothing, as the mutex will never be fought for
  • For threaded loading it means it will only ever be rendering or loading, never both at once
  • For threaded drawing, where loading is done on the main thread, point 2 is still true

This makes it about as deferred as D3D11, where we have a big lock on the DeviceContext.

Where it gets a little weird is the function dependency tree... these are all the functions that record a command, and the functions that use them:

INTERNAL_BufferMemoryBarrier
	INTERNAL_GetTextureData
	SetTextureData*
INTERNAL_ImageMemoryBarrier
	INTERNAL_GetTextureData
	INTERNAL_CreateFauxBackbuffer
	INTERNAL_BeginRenderPass
	INTERNAL_OutsideRenderPassClear
	SwapBuffers
	VerifySampler
	ResolveTarget
	SetTextureData*
	GenColorRenderbuffer
	GenDepthStencilRenderbuffer
	CreateDevice
INTERNAL_GetTextureData
	GetTextureData2D
		ReadBackbuffer
	GetTextureDataCube
INTERNAL_SetViewportCommand
	INTERNAL_BeginRenderPass
	SetViewport
INTERNAL_SetScissorRectCommand
	INTERNAL_BeginRenderPass
	SetScissorRect
	ApplyRasterizerState
INTERNAL_SetStencilReferenceValueCommand
	INTERNAL_BeginRenderPass
	SetReferenceStencil
INTERNAL_SetDepthBiasCommand
	ApplyRasterizerState
INTERNAL_BindPipeline
	Draw*
INTERNAL_BeginRenderPass
	INTERNAL_UpdateRenderPass
		Clear
		Draw*
INTERNAL_EndPass
	INTERNAL_UpdateRenderPass
		Clear
		Draw*
INTERNAL_RenderPassClear
	Clear
INTERNAL_OutsideRenderPassClear
	Clear
INTERNAL_BindResources
	Draw*
SwapBuffers
Draw*
SetBlendFactor
	SetBlendState
ApplyVertexBufferBindings
ResolveTarget
SetTextureData*
QueryBegin
QueryEnd
SetStringMarker

We have to make sure to lock for the duration that a single function submits commands, in addition to locking when a pass is active and a buffer isn't recording. This will prevent commands from getting mixed up when a thread tries to poke a command in during another function's execution (if that matters, it may not?).

EDIT: vkCmd limitations:

Clear/Draw, INSIDE ONLY:
vkCmdBindDescriptorSets
vkCmdBindIndexBuffer
vkCmdBindPipeline
vkCmdBindVertexBuffers
vkCmdClearAttachments
vkCmdClearColorImage
vkCmdClearDepthStencilImage
vkCmdDraw
vkCmdDrawIndexed
vkCmdSetBlendConstants
vkCmdSetDepthBias

ResolveTarget, SwapBuffers:
vkCmdBlitImage

SetTextureData, OUTSIDE ONLY:
vkCmdCopyBufferToImage

GetTextureData, OUTSIDE ONLY:
vkCmdCopyImageToBuffer

Wherever, really:
vkCmdPipelineBarrier
vkCmdSetScissor
vkCmdSetViewport
vkCmdSetStencilReference
vkCmdInsertDebugUtilsLabelEXT

BeginQuery, restarts pass:
vkCmdResetQueryPool
vkCmdBeginQuery

EndQuery, wherever:
vkCmdEndQuery

LOCKS THE PASS:
vkCmdBeginRenderPass

UNLOCKS THE PASS:
vkCmdEndRenderPass

@flibitijibibo
Copy link
Member Author

flibitijibibo commented Jul 8, 2020

Some more threading notes:

Previously we tried decoupling render and data command buffers, but we didn't quite get it working because there's a bit more to it than just having separate buffers to avoid render pass errors. It may be worth trying this again, but with a better idea of how to synchronize the buffers:

Render Command Buffer

  • Locks for the following commands:
    • vkCmdBindDescriptorSets
    • vkCmdBindPipeline
    • vkCmdClearAttachments
    • vkCmdClearColorImage
    • vkCmdClearDepthStencilImage
    • vkCmdDraw
    • vkCmdDrawIndexed
    • vkCmdSetBlendConstants
    • vkCmdSetDepthBias
    • vkCmdBlitImage
    • vkCmdSetScissor
    • vkCmdSetViewport
    • vkCmdSetStencilReference
    • vkCmdInsertDebugUtilsLabelEXT
    • vkCmdResetQueryPool
    • vkCmdBeginQuery
    • vkCmdEndQuery
    • vkCmdBeginRenderPass
    • vkCmdEndRenderPass
  • Locks and submits the data command buffer for the following commands:
    • vkCmdBindIndexBuffer
    • vkCmdBindVertexBuffers
    • VULKAN_INTERNAL_ImageMemoryBarrier

Data Command Buffer

  • Locks for the following commands:
    • vkCmdCopyBufferToImage
    • VULKAN_INTERNAL_AllocateSubBuffer
    • VULKAN_INTERNAL_BufferMemoryBarrier
  • Locks and submits the render command buffer for the following commands:
    • vkCmdCopyImageToBuffer
    • VULKAN_INTERNAL_Stall

When to Lock Command Buffers

It's not enough to lock just for vkCmd operations - we need to lock for whole blocks of commands, as several commands may need to exist together for the queue to work (for example, render passes need to have both a Begin and End, and SetData needs one lock/unlock pair for both the buffer memory barriers and the vkCmdCopyImageToBuffer in between). Because of this, we may need to refactor the internal functions in such a way that it's clear what is a data command and what is a render command, while also making it clear when we're locking the command buffer to submit commands and when we're locking to submit to the queue. This is particularly bad for image memory barriers and pipeline stalls, where we typically have render and data operations happening all at once.

@flibitijibibo
Copy link
Member Author

flibitijibibo commented Jul 9, 2020

Yup, even more threading notes:

Vulkan commands called by every driver function:

  • VULKAN_SwapBuffers
    • vkCmdPipelineBarrier
    • vkCmdEndRenderPass
    • vkCmdBlitImage
  • VULKAN_Draw*
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
    • vkCmdBeginRenderPass
    • vkCmdSetViewport
    • vkCmdSetScissor
    • vkCmdSetStencilReferenceValue
    • vkCmdSetBlendConstants
    • vkCmdSetDepthBias
    • vkCmdBindPipeline
    • vkCmdBindDescriptorSets
    • vkCmdBindIndexBuffer
    • vkCmdDrawIndexed
    • vkCmdDraw
  • VULKAN_SetViewport
    • vkCmdSetViewport
  • VULKAN_SetScissorRect
    • vkCmdSetScissor
  • VULKAN_SetBlendFactor
    • vkCmdSetBlendConstants
  • VULKAN_SetReferenceStencil
    • vkCmdSetStencilReference
  • VULKAN_SetBlendState
    • vkCmdSetBlendConstants
  • VULKAN_SetDepthStencilState
    • vkCmdSetStencilReference
  • VULKAN_ApplyRasterizerState
    • vkCmdSetScissor
    • vkCmdSetDepthBias
  • VULKAN_VerifySampler
    • vkCmdPipelineBarrier
    • vkCmdEndRenderPass
  • VULKAN_VerifyVertexSampler
    • vkCmdPipelineBarrier
    • vkCmdEndRenderPass
  • VULKAN_ApplyVertexBufferBindings
    • vkCmdBindVertexBuffers
  • VULKAN_ResolveTarget
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
    • vkCmdBlitImage
  • VULKAN_ResetBackbuffer
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
  • VULKAN_ReadBackbuffer
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
    • vkCmdCopyImageToBuffer
  • VULKAN_SetTextureData
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
    • vkCmdCopyBufferToImage
  • VULKAN_GetTextureData
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
    • vkCmdCopyImageToBuffer
  • VULKAN_GenColorRenderbuffer
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
  • VULKAN_GenDepthStencilRenderbuffer
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
  • VULKAN_SetVertex/IndexBufferData
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
  • VULKAN_GetVertex/IndexBufferData
    • vkCmdEndRenderPass
    • vkCmdPipelineBarrier
  • VULKAN_QueryBegin
    • vkCmdEndRenderPass
    • vkCmdResetQueryPool
    • vkCmdBeginQuery
  • VULKAN_QueryEnd
    • vkCmdEndQuery
  • VULKAN_SetStringMarker
    • vkCmdInsertDebugUtilsLabelEXT

@flibitijibibo
Copy link
Member Author

As of the latest commit, the VulkanRenderer is now in Beta. All the critical features are done and function, but there are still some FIXMEs and a handful of places where performance improvements are possible, but at this point the renderer should Just Work with XNA games now. Instead of having a testing period and locking down a release date, we'll try and get it to where it's indisputably faster than OpenGL, then lock down a Release Candidate and final testing period before making it the default renderer.

@flibitijibibo
Copy link
Member Author

At this point we now have Vulkan running pretty fast with all the uninitialized memory issues handled by Valgrind - there's really only one thing left for us to assess: Moving Vulkan command recording and queue submissions to a thread. This is relatively straightforward on paper, as it should just be a matter of moving VULKAN_INTERNAL_RecordCommands, Flush, and FlushAndPresent on to a dedicated thread which runs on a single semaphore. Adding threads is always harder than that, but as far as existing structure is concerned we've already set this up really really well. Once this works, it is expected that most games will be at least as fast when running with Vulkan, vs running with OpenGL.

SpriteBatchTest shows the issue pretty clearly when you crank up the sprite count, as it's definitely faster to submit commands, but when it's time to present the frame it chokes pretty hard and it doesn't show up in the frame timing at all since it's all squished into Present() right now. Games like FEZ show very stable framerates but still fall behind GL by a pretty sizeable percentage. Games are easiest to compare by using GALLIUM_HUD=fps and VK_INSTANCE_LAYERS=VK_LAYER_MESA_overlay.

@flibitijibibo
Copy link
Member Author

The latest VulkanRenderer now supports proper command buffer generation and submission and asynchronous presentation. As a result, performance is pretty damn good and beats OpenGL in a whole lot of cases now.

The downside is that threading support had to get rolled back, but we expect this to be easier to fix than last time. Once we have our stress tests working, that'll be when we start running through the catalog to make sure everything works, and provided nothing goes wrong (ha) we'll mark this as a release candidate and eventually move it to the top of the driver list. Most of the TODOs/FIXMEs are minor, but they may be important for optimization later on.

One last thing we need that's not critical for testing but is critical for release is PrepareWindowAttributes - we have it falling back when we don't have a loader, but we also need to fall back when no hardware supports the extensions we need. I'm hoping we can mostly rip off device creation here without creating a device, but we'll see...

@TheSpydog
Copy link
Member

Mostly just leaving this as a note for my future self. When we get to testing games, there are a few cases that will need special attention, beyond the standard "does it look ok and not crash" testing. In particular...

  1. Little Racers Street: We'll need to do a soak test in the garage. That scene of the game leaks memory like crazy, so it's a good stress test for both MojoShader uniform buffers and FNA3D-side buffer allocation.
  2. FEZ: We need to check the ending cutscene to make sure lines are being appropriately rendered, since we've run into some issues with that in the past.
  3. FEZ: On MoltenVK there's a weird blip immediately after the initial loading screen, where the screen goes black except for a small sprite in the middle. I'm not sure if this is an MVK bug or we're accidentally tripping some undefined behavior that's just happened to work correctly on the other devices we've tested.

I'm sure there are other special cases, these are just the ones at the forefront of my memory.

@flibitijibibo
Copy link
Member Author

Thankfully 2 won't happen, as it was a spec violation regarding GL_POINT_SPRITE, a state that's not variable in Vulkan and can't have the same issues like radeonsi/iris did.

3 I remember apitracing but I forget what the end result was... a missing clear maybe? It's probably in the Discord logs somewhere.

@flibitijibibo
Copy link
Member Author

flibitijibibo commented Oct 22, 2020

The latest commit should have Vulkan threading in a working state. But, as always, you should run this against any threaded work you have just in case, as behavior tends to be highly unpredictable for XNA games.

In my own testing this works well (Streets of Rage 4 is running fine with it), though I'm now realizing that we're having to look at memory management again as it uses well over twice the memory that GL does, which is fine if you have a lot of VRAM but not so fine if you don't (SOR4 bails at loading a single level on a 2GB card). This could possibly be influenced by the threads (maybe some contention with the memory allocator?) but we'll need to investigate that either way.

@flibitijibibo
Copy link
Member Author

We just pushed the commits to FNA and FNA3D that resolve all the known issues we have - the remaining FIXMEs/TODOs in the source aren't critical for release.

At this point we can finally start doing mass testing of games. I happen to be releasing Streets of Rage 4 really really soon, so I'll be monitoring that...

@flibitijibibo
Copy link
Member Author

Testing is moving along slowly but surely - one major detail though: Has anyone checked MoltenVK on iOS? Even if we just test Rogue Legacy that's fine, I just want to be sure we're not losing anything by fully deprecating Metal when the Vulkan work is done.

@thatcosmonaut
Copy link
Member

thatcosmonaut commented Dec 17, 2020

Oh I meant to ask, on Mono 4 games like Sequence and Blueberry Garden I get the following error when I drop in a new FNA.dll and fnalibs:

❯ FNA3D_FORCE_DRIVER=Vulkan ./Sequence.bin.x86_64 
Stacktrace:

  at SDL2.SDL.SDL_SetHintWithPriority (string,string,SDL2.SDL/SDL_HintPriority) <0x0008b>
  at Microsoft.Xna.Framework.SDL2_FNAPlatform.ProgramInit (Microsoft.Xna.Framework.LaunchParameters) <0x00593>
  at Microsoft.Xna.Framework.FNAPlatform..cctor () <0x00c23>
  at (wrapper runtime-invoke) object.runtime_invoke_void (object,intptr,intptr,intptr) <0xffffffff>
  at Microsoft.Xna.Framework.Game..ctor () <0x0020f>
  at Sequence.Game1..ctor () <0x015e7>
  at Sequence.Program.Main (string[]) <0x0004b>
  at (wrapper runtime-invoke) <Module>.runtime_invoke_void_object (object,intptr,intptr,intptr) <0xffffffff>

Any idea how I can work around this?

@flibitijibibo
Copy link
Member Author

I forget what it is but I think it has to do with the compiler vs runtime? I fixed it locally either by using an older compiler or updating the runtime, and Blueberry needs MONO_FORCE_COMPAT=1.

@flibitijibibo
Copy link
Member Author

We just got the last major issue fixed in the Vulkan renderer - I'll be doing a full run of SOR4 on a 2GB card and if all is well we'll be in release candidate for 21.02. If I don't even run into memory issues I'll also take out the VULKAN_MEMORY_REQUIREMENT variable.

@flibitijibibo
Copy link
Member Author

We're at the point where this is actively shipping in stuff and works well enough, so at this point I'm going to mark this as "stable, but not default". For making this default, see #72.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants