Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsx/spu: Performance optimizations and other improvements #3026

Merged
merged 9 commits into from
Jul 19, 2017

Conversation

kd-11
Copy link
Contributor

@kd-11 kd-11 commented Jul 17, 2017

Highlights

  1. RSX
  • Avoid aggressive resource create/delete cycles when using vulkan. Moderate speedup in some games and fixes flickering in some cases
  • Fix a vulkan crash about undeclared fog_c. Also fixes default parameter initialization that should fix some games that have broken graphics with vulkan but work fine with openGL
  • Fix an rsx crash when a null address is provided for the fragment shader location
  • Improved multithreaded vertex processing so that the penalty is much lower if there are no resources available. Its also tunable now allowing the threshold to be set
  1. SPU
  • AsmJit: Avoid aggressively locking the asmjit db and make compilation step multithreaded. It takes a much shorter time to compile a function than it does to wait on a lock especially with non-spurs-type kernels
  • AsmJit: Cache compiled functions and avoid calling database analyse function when not needed
  • Add loop condition detection by triggering an OS scheduler update on RdDec - this function can only be sensibly used as part of a loop. [Disabled by default for now]
  • Add concurrent execution analysis. This only affects spurs-type kernels where multiple threads are executing the exact same code at the exact same time. Introduces a small delay to racing threads so that they are effectively desynchronized whenever they enter a sensitive function. [Disabled by default]
  1. Other
  • CMake fix to not double-compile qrc_resources.cpp as it is a tmp generated file. Also adds it to list of files to clean

TODO before merge:

  • Reduce watchdog memory footprint by using the segment address as a base + 256K
  • Investigate tales of vesperia regression

@zminhquanz
Copy link

You need to fix random timing of the lower left corner of the note is not full of black texture error, and some 2d material can not be read on Project Diva F

@AniLeo
Copy link
Member

AniLeo commented Jul 17, 2017

Is that related to this PR at all? If so, what song?

@Lemiru
Copy link

Lemiru commented Jul 17, 2017

For Deception IV: The Nightmare Princess this PR fixed this crash on Vulkan:

E {rsx::thread} RSX: ERROR: 0:65: 'tc9' : undeclared identifier 
ERROR: 0:65: '' : compilation terminated 
ERROR: 2 compilation errors.  No code generated.


E {rsx::thread} RSX: 
F {rsx::thread} class std::runtime_error thrown: Failed to compile vertex shader
(in file C:\rpcs3\rpcs3\Emu\RSX\VK\VKVertexProgram.cpp:401)

@Xcedf
Copy link

Xcedf commented Jul 17, 2017

Assassin's Creed and Prince of Persia now go ingame
27
22
but Resident Evil Revelations now has flickering issues under OGL

@Xcedf
Copy link

Xcedf commented Jul 18, 2017

Resident Evil Revelations issue i've reported now fixed on latest rev
Thanks

@kd-11 kd-11 changed the title [WIP/Testing Needed] rsx/spu: Performance optimizations and other improvements rsx/spu: Performance optimizations and other improvements Jul 18, 2017
@Xcedf
Copy link

Xcedf commented Jul 18, 2017

After the last commit noticed that performance gains from SPU threads were killed, but reverting it gave me nothing, strange but reverting d7a9643 returns the speed, the problem didn't even existed before 86bd6b6
here's the difference
Heavy Rain 4 SPU Threads last commit
63
and with d7a9643 reverted
62
Same for RDR 5 SPU Threads
The last
64
and with d7a9643 reverted
65

@kd-11
Copy link
Contributor Author

kd-11 commented Jul 19, 2017

I know it's faster without the reader lock but I cant guarantee non msvc compilers like gcc and clang will not crash so I added it back for now. A better solution will be more stable. If building for yourself on windows you can comment out the reader lock

@kd-11
Copy link
Contributor Author

kd-11 commented Jul 19, 2017

Also spu threads does not work the way you think. Its either better performance with it at 1 or 2 or it works better disabled. If you're using 4 threads you will likely only benefit from loop detection

@Xcedf
Copy link

Xcedf commented Jul 19, 2017

Thanks for the info

const auto limit = std::min(max_size, func->size) >> 2;

bool failed = false;
for (u32 dword = 0; dword < limit; dword++)
Copy link
Contributor

@danilaml danilaml Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is better than simple memcmp? AFAIK it's usually vectorized as well (certainly on GCC/Clang).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I'm undoing most of this commit actually. The compiled blocks are actually so small that its not worth it

@kd-11
Copy link
Contributor Author

kd-11 commented Jul 19, 2017

@Xcedf Performance issues due to the locks are resolved now in a much cleaner way. You may retest.

@Xcedf
Copy link

Xcedf commented Jul 19, 2017

@kd-11 Confirm. Performance problems resolved, things even slightly faster now

@Xcedf
Copy link

Xcedf commented Jul 19, 2017

not tested T6 for long time
it now hits fullspeed sometimes
avg fps 54-58 on this branch on many areas and characters
game is pretty much Playable now
57

@kd-11 kd-11 mentioned this pull request Jul 19, 2017
2 tasks
- Significant gains due to avoiding aggressive create-delete cycles every frame
- Delays threads by a predetermined amount to 'desync' spurs kernels.
  Largely reduces lock contention issues as well as making spurs kernels
  play nice with reservations
- Also reduces number of lost notifications (SPU_EVENT_LR)
- Improvements to framebuffer usage; Avoid creating new resources every frame
- Handle null fragment program properly
- Collect vertex upload statistics

- vk: Pre-initialize 'unused' varying registers in the vertex shader in case it gets matched with a fs that consumes it
 -- Fixes a crash about fog_c not being declared

gl/dx12/vk: Handle null fragment program

- cleanup - use yield semantic instead of sleep(0) as yield is more cross-platform
 -- sleep(0) is a windows specific scheduler hint
…n code

- spus run a tight gpu-style kernel with no multitasking on the cores themselves
-- this does not map well to PC processor cores because they never sleep even when doing nothing
-- the poll detection hack tries to find a good place to insert a scheduler yield
-- RdDec is a good spot as it signifies the spu kernel is waiting on a timer
- Properly handle data 'transfer' when recycling frame buffer images
- Clear 'recycled' surfaces before use
- Gets around the locking issues when fetching from the shared db
@zminhquanz
Copy link

It's improve perfomance on Project Diva F too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants