New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rasterizer_cache: Improve validation skip heuristic #69
Conversation
Have you tested plugins with this change? Plugins use framebuffers in weird ways, so this kind of changes may break them. |
Merging, as no reported regressions so far and this fixes an issue, so it should be a net positive |
Luigi's Mansion: Dark Moon crashes on the begning right before Luigi starts being teleported to the first level. Async shader compilation is disabled. System: Windows 10 Pro x64 |
Need a log file with debug renderer 😄 |
How do I enable the debug renderer? It only happens in Vulkan, OpenGL is fine. |
@DonelBueno |
There you go, @gpucode |
Luigi mansion 2 😋 VID_20240415_073805_869.mp4 |
Hello, does this update bring performance improvements with Mario 3D Land? And how can I activate debugging so that the emulator works well, I use Vulkan. |
Que versión de turnip utilizaste y como solucionaste lo del audio |
Note 12 or 13? You can use a Turnip driver from K1MCH1: https://github.com/K11MCH1/AdrenoToolsDrivers/releases?page=2 Use Qualcoom 615.77, it will help you gain speed in emulation! |
Most 3DS game are relatively well behaved with their VRAM usage, with few framebuffers that don't overlap with each others. However since the system has UMA architecture, nothing stops games from using VRAM in weird ways. In some cases games will try to reuse VRAM by aliasing memory as entirely different textures. This isn't that terrible to handle if the stride stays the same, something that isn't always true.
Luigi's Mansion: Dark Moon will initially use a memory region as a framebuffer with 256 pixel stride, then reuse the same region as a framebuffer with 128 pixel stride. The contents are immediately cleared afterwards so the result doesn't really matter, but this trips up the texture cache and causes a useless and expensive gpu flush per frame.
Kid Icarus also does this. Through the texture cache it shows as various reinterpretations between D24S8 and RGBA8 framebuffers with varying strides. Paper Mario: Sticker Star uses this trick to presumably render the bottom screen. It starts out with 2 color/depth framebuffers with 256 pixel strides and will then reinterpret them as 128, instead of reuse the top framebuffer with a smaller viewport, causing 2 texture flushes per frame.
Arguably the worst case of this aliasing is Spider-Man Edge of Time. I haven't measured how many flushes it causes, but it's probably more than 3 and all of them slow down the game to a crawl making in unplayable.
Citra isn't entirely helpless on that front however. The current heuristic will skip validation if part of the interval is owned by a gpu invalid surface and there is a fill surface overlapping that region:
citra/src/video_core/rasterizer_cache/rasterizer_cache.h
Line 1185 in f5cf180
However this heuristic is kinda busted and just happens to work by luck from what I've seen. So in this PR, I have tweaked the heuristic to consider texture strides as well. If the region is partially owned by a gpu invalidated surface that doesn't have the same stride, validation is skipped. This covers a lot of games, but not all (see the comment). In the near future I want to rewrite the validation portion of the code to fully handle texture flushes in the gpu which will allow arbitrary transforms to occur without round-tripping to the cpu.
(Without going into too much detail, the current validation routine in the texture cache suffers from high overhead and lack of flexibility in regards to validation. It will try to find specific surfaces and it if fails, it doesn't consider alternative ways of salvaging gpu data that could be otherwise very usable. A better approach would be to treat validation as a memory operation more like an image copy, with image copies being an optimization for validations that satisfy rectangular bounds)
All in all, this results in a 2x to 4x performance improvements in the games affected. The below tests were carried on my AMD iGPU at 4x resolution to simulate a more GPU bound scenario