Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Buffer Mirrors for MacOS performance #4899

Merged
merged 18 commits into from Aug 14, 2023

Conversation

riperiperi
Copy link
Member

@riperiperi riperiperi commented May 12, 2023

This PR brings MacOS ryu up to the level of performance seen with the macos1 build, but obviously still missing geometry shaders, etc.

Problem

Guest GPU methods such as the constant buffer updater and inline2memory are used to load arbitrary data into buffer memory at any time. This can be done between draws in a render pass with no additional barriers. When a game uses a constant buffer update that changes a buffer used in a shader, the data must be uploaded each time it's used. Occasionally, either the driver or games queue many updates that are used by the next draw in sequence. Even we update a buffer between draws sometimes - the support buffer that contains things like texture scale.

Vulkan doesn't have an inline data update method like this - you need to be outside a render pass and either perform a buffer copy, update buffer command or compute write. Interrupting a render pass isn't too bad on desktop - you just need to pay for barriers to avoid changing the data while a draw is in progress, and a little bit for mostly useless Vulkan calls. However, on mobile GPUs (which are typically tiled) there is huge cost to ending a render pass.

On tiled GPUs, render passes load texture data from shared memory into tile memory, then store that back into shared memory when the render pass ends. This avoids straining shared memory bandwidth, at the cost of a full copy in and out at the start and end of each render pass. Games written with tiled GPUs in mind can sometimes ignore the data in shared memory (clear/ignore) and/or have it live entirely in tile memory to avoid storing back, but we don't get that luxury as all our textures must exist in shared memory and we have no idea how they will be used in the future.

The result is that consecutive inline updates end and begin the render pass repeatedly, which results in expensive loads and stores potentially each draw. The bigger and more numerous the render targets, the more issue this causes as the data must be copied back and forth every time. This obviously depends on the game, as some games don't read cbu results from the shader, but when it happens you will definitely feel it.

Solution

To avoid both barriers and render pass splits, we can instead "mirror" entire bindings by copying their current state into a temporary buffer and binding that instead. This trades inline updates on the GPU for data copies on the CPU and binding swaps. The buffer data we make copies of ends up being way smaller than the framebuffer copies, so a lot of memory bandwidth and time is saved in the majority of cases.

On Apple Silicon GPUs this can greatly increase performance depending on the game's behaviour.

Limitations

  • Buffers that have been written from the GPU side (storage, buffer copy etc) cannot be mirrored is that write is still in-flight. The copies are made from a mapping of the buffer in the host address space, so we need to make sure no pending changes are present that might affect the buffer before making a copy.
    • When this happens after a mirror has been created, all bindings of this mirror must be replaced with the original buffer, as that data may be intended for an existing binding.
    • If another mirrorable write happens after a mirror has been created and bound, the binding is recreated with a new mirror.
  • Buffers that are not mapped in the host address space cannot be mirrored. This mostly affects desktop GPUs as buffers are always mapped on systems with shared memory. Mirrors are also mostly disabled on desktop GPUs anyways, so this isn't really an issue.
  • Mirrors are considered valid until the next command buffer. They may be freed after the command buffer ends, as their fence on the staging buffer will be met when it completes. As a result, mirrors are recreated whenever the pipeline command buffer is submitted.

Misc changes

  • GetRange has been replaced with GetRangeAligned. This will get a range of a buffer that is aligned to a page size rather than getting the whole buffer range. This is used when mirroring read-only storage buffers, as otherwise they can use bindings much larger than requested.
    • This was originally done to handle some games (Yokai Watch #, can't remember which) reading outside of their binding range. This limits this to only be valid within the page the end address is contained within, which should be a good assumption but should probably be tested anyways.
  • Bindings sent to the backend now include a write flag. This may actually avoid some existing Vulkan only issues where buffer usage rules might not be respected, but is much more important for mirrors to function correctly.
  • BitMapStruct has been added as a bitmap that operates on a struct. This improves memory locality as the entire bitmap can be embedded within another class or struct, the only downside is that the bit array must be of a predefined size.
  • There are changes to the DescriptorSetUpdater to accelerate batch operations on uniform/storage set booleans. This may change by the time this PR is undrafted.
  • The staging buffer has gained a method that allows reserving a range of memory for any purpose. This is currently used for placing mirrors on the staging buffer. They are considered valid until the next command buffer, obviously.
  • The staging buffer size has been increased as it can now hold mirrors.

@riperiperi riperiperi added performance Performance issue or improvement os:macOS An issue or feature request exclusively relating to macOS labels May 12, 2023
@iRonJ
Copy link

iRonJ commented May 12, 2023

On an 13" M1 MBP, this works well for me. Get about ~7FPS more than the current main branch for recent game, still under 30fps but still playable.

@jcm93
Copy link
Contributor

jcm93 commented May 12, 2023

Thank you for posting this! It seems to stutter slightly more severely than macos1, but fixes many issues, and once the stutters level off, 30fps in TotK is very stable (on my M1 Max, at least). Appreciate it and look forward to the rest of the upstreaming.

@CestLucas
Copy link

CestLucas commented May 12, 2023 via email

@JampaUchoa
Copy link

JampaUchoa commented May 13, 2023

Got one crash using this PR in ToTK, could not isolate what caused it to crash, but here is the log, hope it helps.

Ryujinx_1.1.1+0_2023-05-12_17-56-21.log

@dagstuan
Copy link

This PR should probably be added to #4062?

@morceaudebois
Copy link

Tried for a few minutes and it seems to run a tiny bit smoother on M1 Pro. The depths are still about 20fps though and the ground glitches a little in that area.

Will keep testing and report anything interesting :)

@nmpt88
Copy link

nmpt88 commented May 20, 2023

Thank you for your efforts for Mac players, it has quite a significant efficiency boost on my M1 Pro, I wonder if you can fix the white spot issue that has been present on Mac, thanks again

@MCChristor
Copy link

The oldest Version is the fastest for me in TOTK. 10-15fps more then the newest.

@morceaudebois
Copy link

So I got rid of my shader cache (I had been using the same for like three weeks), and I can now run the game at a relatively stable 30fps pretty much everywhere, which finally makes the depths playable! 🎉

Apart from all the glitches, this is starting to be a pretty good experience!

@coyote0770
Copy link

DO NOT USE THIS PR IF YOU ARE ON WINDOWS OR LINUX!

There is a performance impact and there are bugs you might not experience otherwise. Please wait for calls for testing in future.

This PR brings MacOS ryu up to the level of performance seen with the macos1 build, but obviously still missing TFB and geo shaders, etc. As of right now, there may be more bugs than on the macos1 build.

I'm PRing this right now in this incomplete state just so people can use it to run recent games with better performance. That's about it, really.

As bugs are fixed, this PR will be filled out with a more detailed explanation of what it actually does.

Idk if this will help anyone or not, but i wanted to share my experience with this game on Ryujinx so far anyway.

Hardware: M1 MacBook Air (the binned 7core gpu version) I plug it into my display and have an old heatsink just sitting in the right spot on the bottom of the laptop to keep it from thermal throttling, which works surprisingly well without a fan even:

Been playing Zelda TOTK on original 1.0 version of ryujinx. I am also using "fps++30 mod" and the "VisualFixes 1008p_FXAA_OFF_AO_FIX" mod. Ran into major graphics issues in "the depths" part of the game. The floor was missing, weird angled walls that blocked any sort of view, etc. Found developer build 1.1.805 and was able to get through the depth level with out graphics issues. Updated to 1.1.808 and it seems that overall performance is reduced to around 17-23 fps. I stumbled across this thread, and tried this #4899 or Ryujinx 1.1.0+87139ac version.... and man, its working great. Solid 30fps with little to no drop and no glitches. Idk who is responsible for this build version but thank you x10. Running beautifully and smoothly.

@meritozh
Copy link

@coyote0770 just out of curiosity, do you have this issue? #4949

@coyote0770
Copy link

@coyote0770 just out of curiosity, do you have this issue? #4949

The white glowing stuff, yes...sometimes...The hair thing wiggin out in the menus, yes, unless Link is wearing a headband for whatever reason. The shadows of Link on the ground are messed up like that still, yes, I never even noticed it until recently when my gf pointed it out. I did have the chuchus glitch out weird on previous ryujinx versions but it seems okay now. Most of these things don't bother me as it doesn't affect gameplay. Im just happy to have it running and running smoothly on my base-tier mac :)

@riperiperi riperiperi force-pushed the buffer-mirrors-2023 branch 2 times, most recently from 46dfadd to 7c7b055 Compare May 22, 2023 18:03
@jcm93
Copy link
Contributor

jcm93 commented May 22, 2023

Looks like the latest rebase has significant issues with culling and actor updates.

@riperiperi
Copy link
Member Author

GPUs don't work in "culling and actor updates", can you be more specific?

@jcm93
Copy link
Contributor

jcm93 commented May 22, 2023

Sure, pardon the discord links, but here are some clips from the latest CI builds (7d970aa here):

https://cdn.discordapp.com/attachments/1045857828360949830/1110278475861987368/totk.mp4

https://cdn.discordapp.com/attachments/1045857828360949830/1110279509464338523/totk2.mp4

Neither of these issues were present in prior CI builds from this thread.

@meritozh
Copy link

test in Fire Emblem: Three Houses. have one strange rendering issue:

Screen.Recording.2023-05-23.at.23.44.04.mov

confirmed it not exist in release 1.1.819

Copy link
Member

@gdkchan gdkchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks! Really impressive the performance improvement this brings to some titles.

@tgcordell
Copy link

tgcordell commented Aug 12, 2023

Crash just occurred running the latest build - 1.1.0+e3fcd35. In TOTK, I was using a travel medallion for the first time and teleporting to a sky view tower. I believe it crashed during the loading screen during the teleport.

Log attached.

Switch Firmware: 16.0.3
Ryujinx_1.1.0+e3fcd35_2023-08-12_15-56-22.log

TOTK version: 1.1.0
MacOS: 13.5
Laptop: MacBook Pro 14”
CPU: M1 Pro
Ram: 16GB
Backend: Vulkan
Mods/Cheats/etc: None installed
All other system configs: default

Generally slower right now, goal is to reduce render passes in games that do inline updates

Fix support buffer mirrors

Reintroduce vertex buffer mirror

Add storage buffer support

Optimisation part 1

More optimisation

Avoid useless data copies.

Remove unused cbIndex stuff

Properly set write flag for storage buffers.

Fix minor issues

Not sure why this was here.

Fix BufferRangeList

Fix some big issues

Align storage buffers rather than getting full buffer as a range

Improves mirrorability of read-only storage buffers

Increase staging buffer size, as it now contains mirrors

Fix some issues with buffers not updating

Fix buffer SetDataUnchecked offset for one of the paths when using mirrors

Fix buffer mirrors interaction with buffer textures

Fix mirror rebinding

Move GetBuffer calls on indirect draws before BeginRenderPass to avoid draws without render pass

Fix mirrors rebase

Fix rebase 2023
Similar to `Get` with a size that's too large, just treat it as a clamp.
Enabled on all platforms for the IbStreamer.
Probably macos related?
@gdkchan gdkchan merged commit 492a046 into Ryujinx:master Aug 14, 2023
9 checks passed
@benjaminmordaunt
Copy link

benjaminmordaunt commented Aug 14, 2023

This seems to have introduced graphical issues in Pikmin 4 (MacBook Pro M1):

image

EDIT: This also now seems to be causing: "VM Fault hit memory shortage" when entering caves.

@solarmystic
Copy link

As I commented on ryujinx pr-testing discord, this PR seems to cause intermittent frametime spikes in SMO like clockwork after being merged into mainline 1.1.988. There were none on the previous mainline 1.1.987 before this PR got merged.

1.1.988 - 140 Average FPS/106 1% FPS/16 0.1% FPS (large frametime spike captured in screenshot below)
SMOza11988M 12700KF (527 56, -140 6mV) Ryujinx 1X Auto MT, All cores HT OFF, ALT Sch 3466 TUNED

1.1.987 - 139 Average FPS/105 1% FPS/99 0.1% FPS (smooth frametimes with no spikes whatsoever)
SMOza11987M 12700KF (527 56, -140 6mV) Ryujinx 1X Auto MT, All cores HT OFF, ALT Sch 3466 TUNED

jcm93 pushed a commit to jcm93/Ryujinx that referenced this pull request Aug 15, 2023
* Initial implementation of buffer mirrors

Generally slower right now, goal is to reduce render passes in games that do inline updates

Fix support buffer mirrors

Reintroduce vertex buffer mirror

Add storage buffer support

Optimisation part 1

More optimisation

Avoid useless data copies.

Remove unused cbIndex stuff

Properly set write flag for storage buffers.

Fix minor issues

Not sure why this was here.

Fix BufferRangeList

Fix some big issues

Align storage buffers rather than getting full buffer as a range

Improves mirrorability of read-only storage buffers

Increase staging buffer size, as it now contains mirrors

Fix some issues with buffers not updating

Fix buffer SetDataUnchecked offset for one of the paths when using mirrors

Fix buffer mirrors interaction with buffer textures

Fix mirror rebinding

Move GetBuffer calls on indirect draws before BeginRenderPass to avoid draws without render pass

Fix mirrors rebase

Fix rebase 2023

* Fix crash when using stale vertex buffer

Similar to `Get` with a size that's too large, just treat it as a clamp.

* Explicitly set support buffer as mirrorable

* Address feedback

* Remove unused fragment of MVK workaround

* Replace logging for staging buffer OOM

* Address format issues

* Address more format issues

* Mini cleanup

* Address more things

* Rename BufferRangeList

* Support bounding range for ClearMirrors and UploadPendingData

* Add maximum size for vertex buffer mirrors

* Enable index buffer mirrors

Enabled on all platforms for the IbStreamer.

* Feedback

* Remove mystery BufferCache change

Probably macos related?

* Fix mirrors not creating when staging buffer is empty.

* Change log level to debug
@gdkchan gdkchan mentioned this pull request Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu Related to Ryujinx.Graphics graphics-backend:vulkan Graphical bugs when using the Vulkan API os:macOS An issue or feature request exclusively relating to macOS performance Performance issue or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet