Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSVertexTrace::FindMinMax improvements #3940

Merged
merged 4 commits into from
Nov 3, 2021

Conversation

TellowKrinkle
Copy link
Member

@TellowKrinkle TellowKrinkle commented Nov 21, 2020

when you spend a bunch of time optimizing a function because you think it accounts for 30% of runtime but that was actually due to the compiler being stupid and not issues with the algorithm

This PR contains the following changes:

  • Prevents clang from optimizing out our denormal-removal shuffles (10x faster than before for people who compile with clang!)
  • Run divides on four elements at a time instead of two elements and two useless numbers
  • Run per-vertex instead of per-primitive for all things that need to apply to all vertices of a primitive anyways (prevents double-checking vertices in triangle strips, and also reduces pointer chases in the main loop) This broke stuff
  • Remove inaccurate stq
    • With the above division improvements, on processors with partially-pipelined division (Ivy Bridge and later, Bulldozer and later), accurate stq is actually faster (according to both IACA of inaccurate vs accurate and LLVM MCA). On older CPUs expect performance to be about 2/3 of the old algorithm before taking into account improvements from not double-checking vertices.
    • There seems to be a check of accurate_stq when the OGL backend is deciding whether to use geometry shaders to process sprites, does anyone know if that's important or just something someone threw in there that's okay to remove? Reason found, added as comment

In the end, ignoring clang issues, GSVertexTrace::FindMinMax goes from taking about 3% of MTGS thread runtime to 1.5% on my computer. (Most of the time was spent doing OpenGL things so if you have a more efficient OpenGL driver it might make more of a difference for you)

@orbea

This comment has been minimized.

@TellowKrinkle
Copy link
Member Author

This brings the slowdowns on Xenosaga Episode 1 during the third cut scene from ~40-45 fps to ~50-55 fps when using clang11.

Curious, after this patch, are clang builds slower than gcc builds here?

@frymezim
Copy link

frymezim commented Dec 6, 2020

@iMineLink or @gregory38 might be able to help review this pull request.

@orbea

This comment has been minimized.

@lightningterror
Copy link
Contributor

lightningterror commented Apr 20, 2021

I'd add a prefix to the newest commit gsdx-ogl:.

@lightningterror
Copy link
Contributor

Would be nice to rename GSdx to GS in commit names as GS was merged.

@TellowKrinkle TellowKrinkle force-pushed the FindMinMax branch 2 times, most recently from 21ed40b to 2a51f2e Compare October 18, 2021 01:58
TellowKrinkle and others added 3 commits October 19, 2021 21:03
Why repeat things when you can make the compiler repeat them for you
They were the same speed or slower than full div on IvyBridge+ and Bulldozer+
@refractionpcsx2
Copy link
Member

refractionpcsx2 commented Oct 20, 2021

Been messing with this (while reimplementing the CLAMP optimisation), it looks good to me, everything seems to work good :)

Edit: There is a bug, will tell you about it on discord

Edit2: Nope, it was VS screwing me over, the PR is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants