New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise primitive_restart::upload_untouched() #6881
Optimise primitive_restart::upload_untouched() #6881
Conversation
I tested, +3/4 FPS on Nier Replicant/NieR on my test spot (61 FPS -> 64,5 FPS). Nice. :) |
42024e3
to
e255c81
Compare
It seems your Linux CI isn’t configured to allow AVX2, so for now I added |
I tested a few other games besides Nier to see the impact elsewhere.
We could say that it is within the margin of calculation error but it is very consistent so the gain is real. I found no regression for the moment. |
69b47a2
to
84bfd3b
Compare
84bfd3b
to
215e381
Compare
I feel like we're not communicating effectively. We don't need the gcc attributes check in this case. Only local users build with march=native, appimages build with march=sse2 always. In fact, appimages don't even build with gcc at all, we use clang for the official builds. With that being said, you find a curious difference between GCC and Clang; clang does not allow the method definitions to even exist if the march level is below it; i.e if march is SSE2 all SSE3 intrinsics will be undefined methods and throw errors, hence the inline asm workarounds. |
215e381
to
6245ac5
Compare
It actually was a typo (I added a semicolon after the With this attribute in place now, I believe we build both versions on all three OSes, and select the correct one at runtime. |
6245ac5
to
e21dc87
Compare
There, I believe I fixed all of your comments. I also split the commits so that they are easier to understand separately. |
Well, the target attribute doesn't work the way you want it to, but it got clang to compile AVX and SSE4.1 code without much problem which is a win. Multiversioning usually works if there is a fallback overload with the same function name marked with a "default" target. Its not important for rpcs3 anyway and the proposed solution does generate the assembly output I wanted so it should be good. I'll have to check the u32 paths though since they're untested. |
Do you have a stash of tests btw, so that I could verify I’m not regressing things in the future? |
e21dc87
to
62f91a8
Compare
Yes, I have a stash of tests, but not for this particular situation. I do provide them and reference data if you happen to touch something that I have hardware tests for. I should probably put them somewhere easily accessible. |
62f91a8
to
fb33c18
Compare
I haven’t been able to get anything but a black screen out of this rrc file, so not yet. Do you have any idea what could cause it? Also, is there a simpler way than drag and drop to open this file? I couldn’t find any menu or command line way. ^^' |
You need to enable the debug tab in your GuiConfig.ini file inside the GuiConfigs folder in your config folder. I think on linux its ~/.config/rpcs3/GuiConfigs. Set ShowDebugTab to true and a new Utilities menu will appear where you can run rrc files from. Note that RRC files will execute using your global configuration, so make sure to set up correctly. Disable asynchronous shaders when loading RRC files to avoid possible corruption. |
ea7859a
to
ff423f7
Compare
Thanks, MGS4 is now fixed, the issue was that the indices remaining after the SIMD version ran were written in the wrong order, from last to first… This trace also let me find this bug using anv: #6916. |
ff423f7
to
b4327a8
Compare
This will break bisect since we now have a sequence of commits that are all broken fixed by a commit at the end. I'll just squash them all during merge to avoid this, but for future reference it would have been nice to not have to. |
I can reorder them, no worries. That’s much better than squashing. |
Typo in commit description |
71f4666
to
ae3c188
Compare
There, should be fine for merging now. :) I did test all four commits for no regression on both NieR and this MGS4 rrc. |
ae3c188
to
031c993
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave this up for a day or so to allow for more user testing. Code looks ok at a glance.
This optimisation is only applied when skip_restart is false. I’ve only tested the u16 codepath, as it is the one used in NieR. In some very unscientific profiling, this function used to take 2.76% of the total frame time at the save point of the port town, it now takes about 0.40%.
This assures the compiler we will take care of only calling these functions after having checked that the CPU does support these instructions.
Now that clang is aware that our functions are compiled with SSE4.1, it lets us generate this code using its intrinsics.
This is done using minpos and srli intrinsics and generate less code than before. Thanks Nekotekina for the suggestion!
031c993
to
12e624e
Compare
Optimisation added, even though it only gets called once per copy so it’s not very hot. I also couldn’t find any information in the Intel manuals on the latency and throughput of |
I use Agner's instruction tables as a reference for that kind of stuff. It includes cycles, latency and more for lots of x86 CPUs. |
Why did you squash anyway in the end? |
I'm sorry, didn't realize the communication regarding it |
No worries, I was just surprised. :) |
* rsx: Optimise primitive_restart::upload_untouched() with SSE4.1 This optimisation is only applied when skip_restart is false. I’ve only tested the u16 codepath, as it is the one used in NieR. In some very unscientific profiling, this function used to take 2.76% of the total frame time at the save point of the port town, it now takes about 0.40%. * rsx: Mark all SSE4.1 functions with attributes on gcc and clang This assures the compiler we will take care of only calling these functions after having checked that the CPU does support these instructions. * rsx: Add an AVX2 implementation of primitive restart ibo upload * rsx: Remove redefinition of SSE4.1 instructions Now that clang is aware that our functions are compiled with SSE4.1, it lets us generate this code using its intrinsics. * rsx: Optimise vector to scalar conversion This is done using minpos and srli intrinsics and generate less code than before. Thanks Nekotekina for the suggestion!
This optimisation targets SSE4.1 and is only applied when
skip_restart
is false.I’ve only tested the
u16
codepath, as it is the one used in NieR.In some very unscientific profiling, this function used to take 2.76% of the total time at some spot of the port town, it now takes about 0.40% on SSE4.1, and 0.30% on AVX2.