-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some revamp on IQM and MD5 model code #389
Conversation
About the question on https://godbolt.org/z/sYsY6a (clang 6.0) basically I compared: a[ 0 ] = b[ 0 ];
a[ 1 ] = b[ 1 ];
a[ 2 ] = b[ 2 ];
a[ 3 ] = b[ 3 ]; with: memcpy( a, b, 4 * sizeof( a[ 0 ] )); And they both produce the same code: mov esi, dword ptr [rsp]
mov edx, dword ptr [rsp + 4]
mov ecx, dword ptr [rsp + 8]
mov r8d, dword ptr [rsp + 12] So I'll remove that memcpy commit. Note: the memcpy was used on object with two component, we don't have macro like Edit: well, the compiler is probably doing some optimization for the next function call… I don't know if that's meaningful. |
About the https://godbolt.org/z/7P6dnx (clang 6.0) if ( a[ 0 ] == 0 && a[ 1 ] == 0 && a[ 2 ] == 0 && a[ 3 ] == 0 ) with: if ( memcmp( a, b, 4 * sizeof( b[ 0 ] ) ) ) and: if ( !( a [ 0 ] | a[ 1 ] | a [ 2 ] | a [ 3 ] ) ) And they produced: mov eax, dword ptr [rsp]
or eax, dword ptr [rsp + 4]
or eax, dword ptr [rsp + 8]
or eax, dword ptr [rsp + 12]
jne .LBB0_2 and: movdqa xmm0, xmmword ptr [rsp]
pcmpeqb xmm0, xmmword ptr [rsp + 16]
pmovmskb eax, xmm0
cmp eax, 65535
je .LBB1_2 and: mov esi, dword ptr [rsp]
mov edx, dword ptr [rsp + 4]
mov eax, edx
or eax, esi
mov ecx, dword ptr [rsp + 8]
or eax, ecx
mov r8d, dword ptr [rsp + 12]
or eax, r8d
jne .LBB2_2 Third is probably way slower, but I don't know if there is something better in first or second. Since first and second implementation have 4 instructions (not including the jump), it looks like the Interestingly, with clang11, the first implementation is done the movdqa xmm0, xmmword ptr [rsp]
pcmpeqb xmm0, xmmword ptr [rsp + 16]
pmovmskb eax, xmm0
cmp eax, 65535
je .LBB1_2 Edit: the |
Hmm, for the “compare with zero” code (question about https://godbolt.org/z/bbxd3o (gcc 10.2)
mov eax, DWORD PTR [rsp]
or eax, DWORD PTR [rsp+4]
or eax, DWORD PTR [rsp+8]
or eax, DWORD PTR [rsp+12]
je .L5
mov rdx, QWORD PTR [rsp+8]
mov rax, QWORD PTR [rsp]
xor rdx, QWORD PTR [rsp+24]
xor rax, QWORD PTR [rsp+16]
or rdx, rax
je .L6
mov eax, DWORD PTR [rsp]
or eax, DWORD PTR [rsp+4]
or eax, DWORD PTR [rsp+8]
or eax, DWORD PTR [rsp+12]
je .L15 So with GCC, the For the “set all component” code (question about https://godbolt.org/z/dT9qG4 (gcc 10.2) mov rsi, QWORD PTR [rsp]
mov rcx, QWORD PTR [rsp+8] Which looks very efficient, but then again it's probably some optimization for the next function call. But we know GCC optimizes less with that |
899dd3b
to
9eaeb61
Compare
I believe it works something like this: When the engine main thread calls an OpenGL function, the GL driver copies the data into a graphics operation queue. The GL driver runs all the time, in parallel with the engine thread, to perform any operations in the queue. If the queue is too full, then when the engine thread calls an OpenGL function, it may have to wait for some operations to be completed before there is room to insert the operations into the queue. I saw something similar while investigating Unvanquished/Unvanquished#1109. The GL driver will block the main thread for a while if the graphics processing pipeline gets too far behind. https://www.khronos.org/opengl/wiki/Synchronization#Asynchronous_action When the GPU is by far the bottleneck, you will get zero increase in overall throughput by speeding up the engine code. |
Yes, that's what I'm afraid of. What I would like to understand is why when I double framerate when I return right before the |
Good question... maybe the garbage data in Perhaps a fairer comparison would be to always take the |
If I do this the scene runs at Anyway, that's how it looks on such hardware: |
OK now I understand, the code in question only runs if r_vbomodes=0 or some capabilities are missing. The Tess_SurfaceIQM commit looks good - it should help to reduce pointer aliasing. With the Tess_SurfaceMD5 commit, it does not seem helpful to me to combine the |
Yeah, those loops were some experiments, I plan to revert that on md5mesh side too, and to implement freem's suggests as well, it's just that at this point I only redid properly the IQM code after my first attempt. |
For the loop merge, at first I was looking for a way to reduce the number of loops, but the conditions may just destroy prediction, and anyway I did not noticed any gain, so I'm just reverting them. |
@slipher: I reproduced the pointer optimization on md5mesh code, and moved tests outside of loops. |
3a31e66
to
bdb6aa9
Compare
I rebased on current master. To remind the context, this work was motivated by Unvanquished/Unvanquished#1207 when I discovered that some GPUs may not be able to process the model transformations because of some limits and then, the engine would rely on a fallback that is too slow to make the game playable. Most of that PR is to rewrite a lot of code to help compiler optimize better the code, for example by reducing some pointer indirections and reduce branching (so less branching prediction uncertainty). Other parts of that PR is to rewrite both IQM and MD5 code to reduce the most possible differences in their implementations, in a way to make possible to port back and forth optimizations or identify differences that may explain performance differences (maybe in the future it would be possible to use the same code for both? who knows). It's believed there is still big rooms for optimizations, given:
But, this PR is already very big, and I would like to see this merged before I look at other optimizations. I basically need this to be merged so I can count on it. To test this code, the While this does not bring the twice performance increase I was looking for, the performance gain is still measurable: On plat23 map, using Scene:
Scene:
So that's already a 20~25% gain. I remember that on a very old computer the fps went from 10 to 12 on that complex scene (while the same old computer can do 40 fps when there is none of such non-hardware accelerated models), which confirms the 20% gain. I'm still dreaming of doubling the performances though… |
Note: given comment there: #389 (comment) the MD5 animation is now also a bit faster when done on CPU. |
also, a variable was used before initialization
lsalzman said: > A 0 weight has no influence on the joint, because the influence is multiplied by 0. This is not perfect as it is like saving two fps when we can multiply fps by two instead. The best would be to precompute things at load time in order to prevent branching within loops. See DaemonEngine#390.
@slipher said: TransInit(bone); TransAddScale(entityScale, bone); is equivalent to TransInitScale(entityScale, bone);
I tested on an old Intel integrated (Core 2 Duo L7500 + X3100 / GMA965), on this hardware, the Linux driver emulates the GPU vertex animation, which is the slowest implementation, as we would get only 8 fps on the heavy plat23 alien base scene reference: With our own CPU alternative code ( With this IQM revamp, 15 fps is reached: With this IQM revamp plus q_math inline (#505), 16 fps is reached: This scene ( After this PR and #505, there are two more optimizations paths to explore (both can be done), turning Note: to test this code path, one can set |
Status of the PR is assumed to be ready to merge. The proposed changes are meant to improve performance without intention to change the underlying logic. Except some compute being moved from run time to load time, there was no intention to rethink the way it is done. The proposed changes are now deeply tested over various hardware (Edit: and drivers including Nvidia proprietary one) and using compilers we use for release (gcc and clang) but I only tested on Linux, though I doubt there would be operating-system specificity there. Proposed changes were addressed and suggests applied. Even if not everything got a comment, at this point I even don't know what can be left for a review. I see no reason to delay the merge more so if no one has something to say I'll merge it tomorrow. |
People wanting to use that code can set In the past I talked about The exact cvar to enable the CPU path for model animation is |
I was looking to see if there was a way to reduce engine side the impact of some model slowing down a lot the game on some older hardware (see Unvanquished/Unvanquished#1207) so I incrementally did minor changes over minor changes to make it easier to read, understandable and refactorisable and in the end that was what I got.
Now I know why the performance drops so much, and any improvement to this code can't do magic. Anyway, this code now looks better and may even save some cycles (or rely less on compiler optimization). Also, two loops were rewritten at once in a way the code only iterate once for both.
I reported such changes to the MD5 model code and then it gave me some example to clean-up the IQM code a bit more in return.
There is two changes I have committed separately (see commits about
memcpy
andmemcmp
), that's two alternative implementation for a given code, those two commits are there for a question: is there something to gain from them? The idea why I thought about those alternate implementation is that we may assumememcmp
andmemcpy
may have specific optimized code from standard library. I would like to know if it is better or not, so I can squash them or drop them.Also, I made
r_vboModels
cvar a latched one, because it is required to restart the engine to see a change.Basically, people can set
r_vboVertexSkinning
to0
to reproduce Unvanquished/Unvanquished#1207 on any hardware.The issue is that some Unvanquished models have too much bones for the hardware so some fast OpenGL feature cannot be used, despite the feature is supported.