Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise primitive_restart::upload_untouched() #6881

Merged

Conversation

linkmauve
Copy link
Contributor

@linkmauve linkmauve commented Oct 26, 2019

This optimisation targets SSE4.1 and is only applied when skip_restart is false.

I’ve only tested the u16 codepath, as it is the one used in NieR.

In some very unscientific profiling, this function used to take 2.76% of the total time at some spot of the port town, it now takes about 0.40% on SSE4.1, and 0.30% on AVX2.

@Illynir
Copy link

Illynir commented Oct 26, 2019

I tested, +3/4 FPS on Nier Replicant/NieR on my test spot (61 FPS -> 64,5 FPS).
There will probably be other games affected but we have to find them.

Nice. :)

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch 3 times, most recently from 42024e3 to e255c81 Compare October 26, 2019 08:29
@linkmauve
Copy link
Contributor Author

It seems your Linux CI isn’t configured to allow AVX2, so for now I added #if __AVX2__ guards around this code, the SSE4.1 speedup is already nice on its own.

rpcs3/Emu/RSX/Common/BufferUtils.cpp Outdated Show resolved Hide resolved
@Illynir
Copy link

Illynir commented Oct 26, 2019

I tested a few other games besides Nier to see the impact elsewhere.

  • 1 FPS on Uncharted
  • 2 FPS on Yakuza 3
  • 1 FPS on Wipeout HD Fury
  • 2 FPS on Dragon quest Builders

We could say that it is within the margin of calculation error but it is very consistent so the gain is real. I found no regression for the moment.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch 3 times, most recently from 69b47a2 to 84bfd3b Compare October 27, 2019 23:49
@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from 84bfd3b to 215e381 Compare October 28, 2019 12:42
@kd-11
Copy link
Contributor

kd-11 commented Oct 28, 2019

I feel like we're not communicating effectively. We don't need the gcc attributes check in this case. Only local users build with march=native, appimages build with march=sse2 always. In fact, appimages don't even build with gcc at all, we use clang for the official builds. With that being said, you find a curious difference between GCC and Clang; clang does not allow the method definitions to even exist if the march level is below it; i.e if march is SSE2 all SSE3 intrinsics will be undefined methods and throw errors, hence the inline asm workarounds.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from 215e381 to 6245ac5 Compare October 28, 2019 16:25
@linkmauve
Copy link
Contributor Author

It actually was a typo (I added a semicolon after the __attribute__ in the previous revision…) that prevented both clang and gcc from using the AVX2 version.

With this attribute in place now, I believe we build both versions on all three OSes, and select the correct one at runtime.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from 6245ac5 to e21dc87 Compare October 28, 2019 17:25
@linkmauve
Copy link
Contributor Author

There, I believe I fixed all of your comments. I also split the commits so that they are easier to understand separately.

@linkmauve linkmauve requested a review from kd-11 October 28, 2019 17:28
@kd-11
Copy link
Contributor

kd-11 commented Oct 28, 2019

Well, the target attribute doesn't work the way you want it to, but it got clang to compile AVX and SSE4.1 code without much problem which is a win. Multiversioning usually works if there is a fallback overload with the same function name marked with a "default" target. Its not important for rpcs3 anyway and the proposed solution does generate the assembly output I wanted so it should be good. I'll have to check the u32 paths though since they're untested.

@linkmauve
Copy link
Contributor Author

Do you have a stash of tests btw, so that I could verify I’m not regressing things in the future?

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from e21dc87 to 62f91a8 Compare October 28, 2019 18:55
@kd-11
Copy link
Contributor

kd-11 commented Oct 28, 2019

Yes, I have a stash of tests, but not for this particular situation. I do provide them and reference data if you happen to touch something that I have hardware tests for. I should probably put them somewhere easily accessible.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from 62f91a8 to fb33c18 Compare October 28, 2019 23:00
@linkmauve linkmauve requested a review from kd-11 October 28, 2019 23:04
@linkmauve
Copy link
Contributor Author

I haven’t been able to get anything but a black screen out of this rrc file, so not yet. Do you have any idea what could cause it? Also, is there a simpler way than drag and drop to open this file? I couldn’t find any menu or command line way. ^^'

@kd-11
Copy link
Contributor

kd-11 commented Oct 29, 2019

You need to enable the debug tab in your GuiConfig.ini file inside the GuiConfigs folder in your config folder. I think on linux its ~/.config/rpcs3/GuiConfigs. Set ShowDebugTab to true and a new Utilities menu will appear where you can run rrc files from. Note that RRC files will execute using your global configuration, so make sure to set up correctly. Disable asynchronous shaders when loading RRC files to avoid possible corruption.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch 2 times, most recently from ea7859a to ff423f7 Compare October 29, 2019 12:20
@linkmauve
Copy link
Contributor Author

Thanks, MGS4 is now fixed, the issue was that the indices remaining after the SIMD version ran were written in the wrong order, from last to first…

This trace also let me find this bug using anv: #6916.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from ff423f7 to b4327a8 Compare October 29, 2019 12:32
@kd-11
Copy link
Contributor

kd-11 commented Oct 29, 2019

This will break bisect since we now have a sequence of commits that are all broken fixed by a commit at the end. I'll just squash them all during merge to avoid this, but for future reference it would have been nice to not have to.

@linkmauve
Copy link
Contributor Author

linkmauve commented Oct 29, 2019

I can reorder them, no worries. That’s much better than squashing.

@illusion0001
Copy link
Contributor

Typo in commit description
rsx: Write the ibo tail in the correct order
Fixes MSG4’s graveyard scene.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch 2 times, most recently from 71f4666 to ae3c188 Compare October 29, 2019 13:34
@linkmauve
Copy link
Contributor Author

linkmauve commented Oct 29, 2019

There, should be fine for merging now. :)

I did test all four commits for no regression on both NieR and this MGS4 rrc.

@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from ae3c188 to 031c993 Compare October 29, 2019 14:22
Copy link
Contributor

@kd-11 kd-11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave this up for a day or so to allow for more user testing. Code looks ok at a glance.

This optimisation is only applied when skip_restart is false.

I’ve only tested the u16 codepath, as it is the one used in NieR.

In some very unscientific profiling, this function used to take 2.76% of
the total frame time at the save point of the port town, it now takes
about 0.40%.
This assures the compiler we will take care of only calling these
functions after having checked that the CPU does support these
instructions.
Now that clang is aware that our functions are compiled with SSE4.1, it
lets us generate this code using its intrinsics.
This is done using minpos and srli intrinsics and generate less code
than before.

Thanks Nekotekina for the suggestion!
@linkmauve linkmauve force-pushed the optimise-primitive_restart_upload_untouched branch from 031c993 to 12e624e Compare October 29, 2019 22:23
@linkmauve
Copy link
Contributor Author

linkmauve commented Oct 29, 2019

Optimisation added, even though it only gets called once per copy so it’s not very hot. I also couldn’t find any information in the Intel manuals on the latency and throughput of vphminposuw, but I assume the sheer amount of instructions removed is worth it.

@isJuhn
Copy link
Contributor

isJuhn commented Oct 29, 2019

I use Agner's instruction tables as a reference for that kind of stuff. It includes cycles, latency and more for lots of x86 CPUs.

@Nekotekina Nekotekina merged commit cfd5cf6 into RPCS3:master Oct 30, 2019
@linkmauve linkmauve deleted the optimise-primitive_restart_upload_untouched branch October 30, 2019 15:39
@linkmauve
Copy link
Contributor Author

Why did you squash anyway in the end?

@Nekotekina
Copy link
Member

I'm sorry, didn't realize the communication regarding it

@linkmauve
Copy link
Contributor Author

No worries, I was just surprised. :)

kd-11 pushed a commit to kd-11/rpcs3 that referenced this pull request Nov 2, 2019
* rsx: Optimise primitive_restart::upload_untouched() with SSE4.1

This optimisation is only applied when skip_restart is false.

I’ve only tested the u16 codepath, as it is the one used in NieR.

In some very unscientific profiling, this function used to take 2.76% of
the total frame time at the save point of the port town, it now takes
about 0.40%.

* rsx: Mark all SSE4.1 functions with attributes on gcc and clang

This assures the compiler we will take care of only calling these
functions after having checked that the CPU does support these
instructions.

* rsx: Add an AVX2 implementation of primitive restart ibo upload

* rsx: Remove redefinition of SSE4.1 instructions

Now that clang is aware that our functions are compiled with SSE4.1, it
lets us generate this code using its intrinsics.

* rsx: Optimise vector to scalar conversion

This is done using minpos and srli intrinsics and generate less code
than before.

Thanks Nekotekina for the suggestion!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants