Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsx: Optimizations #6112

Merged
merged 18 commits into from
Jun 25, 2019
Merged

rsx: Optimizations #6112

merged 18 commits into from
Jun 25, 2019

Conversation

kd-11
Copy link
Contributor

@kd-11 kd-11 commented Jun 20, 2019

This PR is a collection of several optimizations done over the past week or so. Highlights:

  • Restructure GLSL emitter to provide more optimal code. This makes spir-v generated consumes fewer VGPRs improving CU occupancy. The improvement here is staggering on kernel-to-kernel comparison, upto 2x, but unless you're running a really low end GPU, this likely will have no impact.
  • Avoids issuing memory barriers when not needed under Vulkan. The old code was a little overzealous in trying to appease the spec. While this is only the first step, it results in >30% reduction in cache invalidations on polaris. As with the last task, this effect will likely mostly affect lower-end systems.
  • Accelerate index buffer processing using SSSE3. Potentially upto 4x faster for 32-bit uploads and 8x faster for 16-bit uploads. This does result in a tangible increase in performance of around 10%.
  • Sets up a base framework for RSX multithreading using an offloader thread to handle PCI-e transfers in parallel with FIFO processing. Currently only useful in some games, where a decent performance uplift can be observed (5-10%). Note that this option can actually hurt framerates in some games and also some processors. More work on this will be done at a later time.
  • Reimplements vertex layout streaming under Vulkan to a streaming model which improves performance by avoiding a costly descriptor copy on every sub-block of a batched draw. Fixes issues on Radeon cards where performance could drop into seconds-per-frame territory with some games.
  • Improve the debugging profiler for RSX to avoid spamming QPC when debug statistics are not enabled. Surprisingly, this resulted in a decent performance uplift.
  • Rewrite ZCULL job dispatch for vulkan. This changes the model from a one-entry-per-draw to a one-entry-per-commandbuffer system which is much more efficient in games that utilize ZCULL heavily. This lowers ZCULL penalty significantly in many titles.

Other miscellaneous improvements:

  • Fixed a typo that cost us performance due to a broken comparison always failing.
  • Improve vulkan driver detection by using KHR_driver_properties extension instead of relying on GPU names.
  • Change the debug layer name to KHRONOS_validation from LUNARG_standard_validation. The latter has been deprecated and superseded by the former.
  • Minor optimization for OpenGL to avoid spamming glSamplerParameter calls over and over when unnecessary.

Overall, the performance uplift on vulkan is quite significant, but unfortunately most games may not see the full benefit as they just moved the performance bottleneck to other units of the emulator. In purely PPU+RSX workloads, the performance jump is anywhere from 20-50% depending on the system. In some games the framerate may not change much, but the working time spent by RSX (RSX load) goes down significantly.

NOTE: If you experience flickering visuals with Multithreaded RSX option enabled, just don't use it. It is a sign that the extra thread is being unscheduled frequently by your OS and is not able to keep up; something which defeats the purpose of an offloader thread. More tuning on this will be done later. This option defaults to OFF as it can hurt performance in some games even on high core count CPUs.

@Whatcookie
Copy link
Member

Getting just a black screen on radv 19.1 on my rx 570

amdvlk seems to work fine

@kd-11
Copy link
Contributor Author

kd-11 commented Jun 20, 2019

Probably depends on what you're running but I can check in a few seconds.

@kd-11
Copy link
Contributor Author

kd-11 commented Jun 20, 2019

Confirmed issue on RADV drivers. It will get fixed before merge.

@Snakegodeater

This comment has been minimized.

@kd-11

This comment has been minimized.

@Snakegodeater

This comment has been minimized.

@jobs-git
Copy link

jobs-git commented Jun 20, 2019

Massive GPU Usage Improvement in GOW3! From 37-40% GPU (last rpcs3) to 13% GPU in this PR! FPS is still low but I guess that came from a different PR

Params:

GOW3, GTX1060 Vulkan SPU-LLVM PPU-LLVM Multithread RSX ON

@MSuih
Copy link
Member

MSuih commented Jun 20, 2019

@Snakegodeater I don't want to be rude, but this pull request is going to end up getting a lot of replies which makes it hard to follow. Asking unrelated questions does not help at all, it'd be better if you asked this in f.ex. Discord chat instead of here.

@kd-11
Copy link
Contributor Author

kd-11 commented Jun 20, 2019

RADV issue seems to be a driver bug relating to texelFetch instruction. I'll open a bug report with them.

@Yahfz
Copy link
Contributor

Yahfz commented Jun 20, 2019

i7-8700K 5.2GHz, Multithread RSX off.

Persona 5

Master
Image1
PR
Image2

inFamous

Master
Image1
PR
Image2

@MsDarkLow
Copy link
Contributor

i5-7300HQ, Multithread RSX off.

Yakuza Dead Souls

Master (~20fps)
MDeadSouls

PR (~24fps)
PRDeadSouls

Ratchet and Clank® Future: Tools of Destruction™

Master (~10 fps)
MToD

PR (~12 fps)
PRToD

@marcin-przywoski
Copy link

In R&C2 I'm seeing around 10 FPS improvement in the most demanding areas. i9 8950HK 5 GHz and 1080

@Whatcookie
Copy link
Member

Whatcookie commented Jun 20, 2019

With 7700K 5GHz

Wipeout HD demo (MT RSX off)

Master

WipeoutHDmaster

PR

WipeoutHD

NGS MT RSX comparison This is the most intense scene in the game (I tried to match them up but its difficult)

Master

NGSMaster

PR MT RSX off

NGSMTrsxoff

PR MT RSX on

NGSMTrsx

It looks like I need to find a new game with RSX bottleneck to benchmark, it runs too well with recent improvements.

MT RSX averages 1-2% less guest utilization in all scenes.

Edit: it's not intended that the MT RSX thread appears as extra load on the PPU threads, is it?

Edit 2: As the MT RSX thread gets loaded it looks as the RSX thread on the overlay shows lower RSX thread usage, and PPU usage increases.
overlaybug
overlaybug2

@jenci8888
Copy link

jenci8888 commented Jun 20, 2019

No longer hangs semaphore after buying the car in Gran Turismo 6, while remains black vehicle is still issue. (kd-11 said "This will fix later.")
before: 0.4fps, after: 17fps~ (not related to gameplay but bought a car with animation)
Plus, I'm having experience flickering in tracks "Dynamic weather and sky" with RSX Multithreaded enabled. (Ignore this, what kd-11 said already.)
Also, gameplay doesn't much improve fps either. (margain improve 1-2%~)
image

@digitaldude555
Copy link

Resistance 3 haven demo doesnt go ingame anymore.

@lex3a
Copy link

lex3a commented Jun 20, 2019

i7-3770K, Multithread RSX off.

Tekken Tag Tournament HD

Master:
master_heli
PR:
opti_heli

@cesarnox
Copy link

cesarnox commented Jun 21, 2019

Amazing! I'm getting like 50% fps improvement in Persona 5

I'm using i5 8400 integrated HD 630 vulkan

Multithread RSX off

@greentop
Copy link

i7-3770, Nvidia 1030 (418.56), Vulkan renderer, Multithread RSX off, Ubuntu 18.04

P5 (Yongen-Jaya):
Master: 21-23 FPS
PR as of c33cf4a: 23-25 FPS

@psennermann

This comment has been minimized.

@kd-11

This comment has been minimized.

@psennermann

This comment has been minimized.

@kd-11

This comment has been minimized.

@psennermann

This comment has been minimized.

@psennermann

This comment has been minimized.

kd-11 added 14 commits June 25, 2019 20:11
- When multithreaded RSX is enabled, the vertex cache just lowers performance
- The small cost of upload is paid by the asynchronous thread, allowing RSX to work optimally
- Remove string comparisons from the hot-path!
- Use attribute streaming and push constants to avoid forcing a descriptor block copy every other draw call/pass.
  While this isn't so bad on nvidia cards, it makes AMD cards a slideshow.
- Avoid spamming the driver with samplerParameter calls unless the parameters have actually changed
- Use a lockless queue
- Do not enqueue small transfers
- Avoid spamming QPC when not needed
- Free performance when debug overlay is not enabled
- Typo fix
- This check leads to forever relocating memory if size never exceeds capacity!
- Do not consume a slot every draw call, instead batch as many draws as possible
- Since renderpasses are dispatched per-draw-clause, keeping occlusion queries outside the renderpasses works fine
- If renderpasses are reorganized, occlusion tasks will have to be reorganized again
- Use two counters to avoid atomic operations
- Yield instead of sleeping because some games are very sensitive to timing
@kd-11
Copy link
Contributor Author

kd-11 commented Jun 25, 2019

Wipeout HD has a serious problem but not from this PR; I'll investigate that as a separate issue. On my setup I get verification failed error when WCB is enabled which is different from what is reported here and is likely unrelated. FFXIII crash does not originate in RSX code and is not in the VRAM partition either. My guess would be race conditions causing problems now that RSX is running "too fast".

@kd-11 kd-11 merged commit 9ce7b8a into RPCS3:master Jun 25, 2019
@spyropt
Copy link

spyropt commented Jun 25, 2019

you are right it's the wcb i tested and without wcb it works
i dont get the verification failed it just hangs

on master before this pr
image
RPCS3.log.gz

@JasonFre
Copy link

Just noting that after Yahz's comments on inFamous I tested it on my PC with this revision and I didn't get anywhere near the FPS he got. For me it ranged from 5 - 25, jumping wildly, with major audio issues going on.

@legend800
Copy link

I can confirm that Wipeout HD runs fine with WCB on the build before this (tried multiple times) and now it always hangs on the loading screen 100% of time before track renders.

So while the game has other issues, this PR really exacerbated the problem I guess and now it doesn't work anymore. Previously, I had this game marked as "fully playable" given there's no issues with it aside from the occasional lag. Let me know if you want a tracking issue for this.

Build before (fine):
RPCS3.log - before.gz

Build after this was merged (broken):
RPCS3 - optimizations pr.zip

@Asinin3
Copy link
Contributor

Asinin3 commented Jun 26, 2019

@JasonFre Yahfz has an 8700k overclocked to 5.3GHz, good luck beating his FPS. If your not on a system that is close to that powerful than ofc you won't be reaching the same performance levels. You should really be listing your hardware/settings though. And if it's just because your hardware isn't strong enough or your settings are wrong than it belongs on Discord. https://discord.me/RPCS3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.