Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some revamp on IQM and MD5 model code #389

Merged
merged 11 commits into from
Jul 18, 2021
Merged

Conversation

illwieckz
Copy link
Member

@illwieckz illwieckz commented Oct 20, 2020

I was looking to see if there was a way to reduce engine side the impact of some model slowing down a lot the game on some older hardware (see Unvanquished/Unvanquished#1207) so I incrementally did minor changes over minor changes to make it easier to read, understandable and refactorisable and in the end that was what I got.

Now I know why the performance drops so much, and any improvement to this code can't do magic. Anyway, this code now looks better and may even save some cycles (or rely less on compiler optimization). Also, two loops were rewritten at once in a way the code only iterate once for both.

I reported such changes to the MD5 model code and then it gave me some example to clean-up the IQM code a bit more in return.

There is two changes I have committed separately (see commits about memcpy and memcmp), that's two alternative implementation for a given code, those two commits are there for a question: is there something to gain from them? The idea why I thought about those alternate implementation is that we may assume memcmp and memcpy may have specific optimized code from standard library. I would like to know if it is better or not, so I can squash them or drop them.

Also, I made r_vboModels cvar a latched one, because it is required to restart the engine to see a change.

Basically, people can set r_vboVertexSkinning to 0 to reproduce Unvanquished/Unvanquished#1207 on any hardware.

The issue is that some Unvanquished models have too much bones for the hardware so some fast OpenGL feature cannot be used, despite the feature is supported.

@illwieckz illwieckz added T-Improvement Improvement for an existing feature A-Renderer T-Cleanup T-Question labels Oct 20, 2020
@illwieckz
Copy link
Member Author

illwieckz commented Oct 21, 2020

About the question on memcpy against per-component copy, it looks like compilers produce the exact same code:

https://godbolt.org/z/sYsY6a (clang 6.0)

basically I compared:

    a[ 0 ] = b[ 0 ];
    a[ 1 ] = b[ 1 ];
    a[ 2 ] = b[ 2 ];
    a[ 3 ] = b[ 3 ];

with:

    memcpy( a, b, 4 * sizeof( a[ 0 ] ));

And they both produce the same code:

  mov esi, dword ptr [rsp]
  mov edx, dword ptr [rsp + 4]
  mov ecx, dword ptr [rsp + 8]
  mov r8d, dword ptr [rsp + 12]

So I'll remove that memcpy commit.

Note: the memcpy was used on object with two component, we don't have macro like VectorCopy for that yet.

Edit: well, the compiler is probably doing some optimization for the next function call… I don't know if that's meaningful.

@illwieckz
Copy link
Member Author

illwieckz commented Oct 21, 2020

About the memcmp stuff to check something is zeroed, I tested:

https://godbolt.org/z/7P6dnx (clang 6.0)

    if ( a[ 0 ] == 0 && a[ 1 ] == 0 && a[ 2 ] == 0 && a[ 3 ] == 0 )

with:

    if ( memcmp( a, b, 4 * sizeof( b[ 0 ] ) ) )

and:

    if ( !( a [ 0 ] | a[ 1 ] | a [ 2 ] | a [ 3 ] ) )

And they produced:

  mov eax, dword ptr [rsp]
  or eax, dword ptr [rsp + 4]
  or eax, dword ptr [rsp + 8]
  or eax, dword ptr [rsp + 12]
  jne .LBB0_2

and:

  movdqa xmm0, xmmword ptr [rsp]
  pcmpeqb xmm0, xmmword ptr [rsp + 16]
  pmovmskb eax, xmm0
  cmp eax, 65535
  je .LBB1_2

and:

  mov esi, dword ptr [rsp]
  mov edx, dword ptr [rsp + 4]
  mov eax, edx
  or eax, esi
  mov ecx, dword ptr [rsp + 8]
  or eax, ecx
  mov r8d, dword ptr [rsp + 12]
  or eax, r8d
  jne .LBB2_2

Third is probably way slower, but I don't know if there is something better in first or second. Since first and second implementation have 4 instructions (not including the jump), it looks like the memcmp implementation starts to become useful with 4 components.

Interestingly, with clang11, the first implementation is done the memcmp way:

https://godbolt.org/z/G7xWoz

  movdqa xmm0, xmmword ptr [rsp]
  pcmpeqb xmm0, xmmword ptr [rsp + 16]
  pmovmskb eax, xmm0
  cmp eax, 65535
  je .LBB1_2

Edit: the xmm0 register is an SSE thing, I assume we should favor C/C++ code that makes compiler produce SSE code as much as we can.

@illwieckz
Copy link
Member Author

illwieckz commented Oct 21, 2020

Hmm, for the “compare with zero” code (question about memcmp), GCC seems to produce better code (Edit: but no SSE one):

https://godbolt.org/z/bbxd3o (gcc 10.2)

&& way:

  mov eax, DWORD PTR [rsp]
  or eax, DWORD PTR [rsp+4]
  or eax, DWORD PTR [rsp+8]
  or eax, DWORD PTR [rsp+12]
  je .L5

memcmp way:

  mov rdx, QWORD PTR [rsp+8]
  mov rax, QWORD PTR [rsp]
  xor rdx, QWORD PTR [rsp+24]
  xor rax, QWORD PTR [rsp+16]
  or rdx, rax
  je .L6

| way:

  mov eax, DWORD PTR [rsp]
  or eax, DWORD PTR [rsp+4]
  or eax, DWORD PTR [rsp+8]
  or eax, DWORD PTR [rsp+12]
  je .L15

So with GCC, the && way and the | way produce the same code, and the memcmp code use one more instruction and one more register. So it's better to use the and way, it is more consistent between compilers.

For the “set all component” code (question about memcpy), unfortunately gcc is optimizing so much the first implementation in my sample code is even not done at all, but for the second implementation (the memcmp one), GCC produces this:

https://godbolt.org/z/dT9qG4 (gcc 10.2)

  mov rsi, QWORD PTR [rsp]
  mov rcx, QWORD PTR [rsp+8]

Which looks very efficient, but then again it's probably some optimization for the next function call. But we know GCC optimizes less with that memcmp code.

@slipher
Copy link
Member

slipher commented Oct 22, 2020

I believe it works something like this:

When the engine main thread calls an OpenGL function, the GL driver copies the data into a graphics operation queue. The GL driver runs all the time, in parallel with the engine thread, to perform any operations in the queue. If the queue is too full, then when the engine thread calls an OpenGL function, it may have to wait for some operations to be completed before there is room to insert the operations into the queue. I saw something similar while investigating Unvanquished/Unvanquished#1109. The GL driver will block the main thread for a while if the graphics processing pipeline gets too far behind.

https://www.khronos.org/opengl/wiki/Synchronization#Asynchronous_action

When the GPU is by far the bottleneck, you will get zero increase in overall throughput by speeding up the engine code.

@illwieckz
Copy link
Member Author

Yes, that's what I'm afraid of. What I would like to understand is why when I double framerate when I return right before the // Deform the vertices by the lerped bones. comment in IQM code. At this point the data is garbage so surfaces are badly rendered, but this doubles the framerate…

@slipher
Copy link
Member

slipher commented Oct 22, 2020

What I would like to understand is why when I double framerate when I return right before the // Deform the vertices by the lerped bones. comment in IQM code.

Good question... maybe the garbage data in tess.verts results in points outside the player's field of view or something, which allows them to be thrown out without further processing?

Perhaps a fairer comparison would be to always take the else clause corresponding to if( model->num_joints > 0 && model->blendWeights && model->blendIndexes ), rather than returning if the condition is true.

@illwieckz
Copy link
Member Author

illwieckz commented Oct 22, 2020

Perhaps a fairer comparison would be to always take the else clause corresponding to if( model->num_joints > 0 && model->blendWeights && model->blendIndexes ), rather than returning if the condition is true.

If I do this the scene runs at 12fps instead of 7fps (this GPU can do 70fps on this map when there is no model on screen).

Anyway, that's how it looks on such hardware:

slow models

@slipher
Copy link
Member

slipher commented Oct 23, 2020

OK now I understand, the code in question only runs if r_vbomodes=0 or some capabilities are missing.

The Tess_SurfaceIQM commit looks good - it should help to reduce pointer aliasing.

With the Tess_SurfaceMD5 commit, it does not seem helpful to me to combine the srf->numTriangles- and srf->numVerts-bounded loops into one. Why do you want to do that?

@illwieckz
Copy link
Member Author

illwieckz commented Oct 23, 2020

Yeah, those loops were some experiments, I plan to revert that on md5mesh side too, and to implement freem's suggests as well, it's just that at this point I only redid properly the IQM code after my first attempt.

@illwieckz
Copy link
Member Author

For the loop merge, at first I was looking for a way to reduce the number of loops, but the conditions may just destroy prediction, and anyway I did not noticed any gain, so I'm just reverting them.

@illwieckz
Copy link
Member Author

@slipher: I reproduced the pointer optimization on md5mesh code, and moved tests outside of loops.

@illwieckz illwieckz changed the title some revamp on IQM and MD5 model code WIP: some revamp on IQM and MD5 model code May 31, 2021
@illwieckz illwieckz marked this pull request as draft May 31, 2021 22:05
@illwieckz illwieckz changed the base branch from 0.52.0/sync to master May 31, 2021 22:41
@illwieckz
Copy link
Member Author

illwieckz commented Jun 29, 2021

I rebased on current master. To remind the context, this work was motivated by Unvanquished/Unvanquished#1207 when I discovered that some GPUs may not be able to process the model transformations because of some limits and then, the engine would rely on a fallback that is too slow to make the game playable.

Most of that PR is to rewrite a lot of code to help compiler optimize better the code, for example by reducing some pointer indirections and reduce branching (so less branching prediction uncertainty). Other parts of that PR is to rewrite both IQM and MD5 code to reduce the most possible differences in their implementations, in a way to make possible to port back and forth optimizations or identify differences that may explain performance differences (maybe in the future it would be possible to use the same code for both? who knows).

It's believed there is still big rooms for optimizations, given:

  • one scene with md5 acid tubes instead of iqm acid tubes is known to have twice the performances on some old hardware,
  • ZTM has done heavy optimizations in ioquake3 involving agressive precomputations and we may want to import those one day.

But, this PR is already very big, and I would like to see this merged before I look at other optimizations. I basically need this to be merged so I can count on it.

To test this code, the r_vboModel cvar must be set to 0. Note: when r_vboModel is set to 0 there is a bug where third-person MD3 weapon models are not displayed, that's not related to this code, and when non-VBO model animation is done because of hardware limit (not because that cvar is disabled), they are properly displayed.

While this does not bring the twice performance increase I was looking for, the performance gain is still measurable:

On plat23 map, using lowest preset and r_vboModel, I get on my desktop computer:

Scene: /setviewpos 2314 2124 308 52 45 (camera on alien overmind, two eggs, leech and barricade)

  • Before: 59 fps
  • After : 74 fps

Scene: /setviewpos 1893 1920 -5 0 0 (camera from alien base entry corridor, even if not everything is seen, the engine renders 6 acid tubes, 3 eggs, 1 overmind, 1 leech and 1 barricade)

  • Before: 28
  • After: 38

So that's already a 20~25% gain.

I remember that on a very old computer the fps went from 10 to 12 on that complex scene (while the same old computer can do 40 fps when there is none of such non-hardware accelerated models), which confirms the 20% gain. I'm still dreaming of doubling the performances though…

@illwieckz illwieckz changed the title WIP: some revamp on IQM and MD5 model code some revamp on IQM and MD5 model code Jun 29, 2021
@illwieckz illwieckz marked this pull request as ready for review June 29, 2021 04:15
@illwieckz
Copy link
Member Author

illwieckz commented Jun 29, 2021

Note: given comment there: #389 (comment) the MD5 animation is now also a bit faster when done on CPU.

also, a variable was used before initialization
lsalzman said:

> A 0 weight has no influence on the joint, because the influence is multiplied by 0.

This is not perfect as it is like saving two fps when we can multiply fps by two instead.

The best would be to precompute things at load time in order to prevent branching
within loops. See DaemonEngine#390.
@slipher said:

	TransInit(bone); TransAddScale(entityScale, bone);

is equivalent to

	TransInitScale(entityScale, bone);
@illwieckz
Copy link
Member Author

illwieckz commented Jul 11, 2021

I tested on an old Intel integrated (Core 2 Duo L7500 + X3100 / GMA965), on this hardware, the Linux driver emulates the GPU vertex animation, which is the slowest implementation, as we would get only 8 fps on the heavy plat23 alien base scene reference:

iqm slow on cpu

With our own CPU alternative code (r_vboVertexSkinning 0), current master gets 13 fps:

iqm slow on cpu

With this IQM revamp, 15 fps is reached:

iqm slow on cpu

With this IQM revamp plus q_math inline (#505), 16 fps is reached:

iqm slow on cpu

This scene (+devmap plat23 +delay 100f setviewpos 1893 1920 0 0 0) is very heavy to render, there are twelve animated buildables to process: one overmind, one leech, one barricade, six tubes, and three eggs.

After this PR and #505, there are two more optimizations paths to explore (both can be done), turning q_math inline functions into SSE code, and porting ZTM's optimizations from ioquake3 as described there: #390 (comment)

Note: to test this code path, one can set r_vboVertexSkinning to 0.

@illwieckz
Copy link
Member Author

illwieckz commented Jul 12, 2021

Status of the PR is assumed to be ready to merge.

The proposed changes are meant to improve performance without intention to change the underlying logic. Except some compute being moved from run time to load time, there was no intention to rethink the way it is done.

The proposed changes are now deeply tested over various hardware (Edit: and drivers including Nvidia proprietary one) and using compilers we use for release (gcc and clang) but I only tested on Linux, though I doubt there would be operating-system specificity there.

Proposed changes were addressed and suggests applied.

Even if not everything got a comment, at this point I even don't know what can be left for a review.

I see no reason to delay the merge more so if no one has something to say I'll merge it tomorrow.

@illwieckz illwieckz merged commit 0cd380a into DaemonEngine:master Jul 18, 2021
@illwieckz
Copy link
Member Author

People wanting to use that code can set r_vboVertexSkinning to 0.

In the past I talked about r_vboModels 0 but this other one also has unwanted effects on other things.

The exact cvar to enable the CPU path for model animation is r_vboVertexSkinning 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants