some revamp on IQM and MD5 model code #389

illwieckz · 2020-10-20T19:54:26Z

I was looking to see if there was a way to reduce engine side the impact of some model slowing down a lot the game on some older hardware (see Unvanquished/Unvanquished#1207) so I incrementally did minor changes over minor changes to make it easier to read, understandable and refactorisable and in the end that was what I got.

Now I know why the performance drops so much, and any improvement to this code can't do magic. Anyway, this code now looks better and may even save some cycles (or rely less on compiler optimization). Also, two loops were rewritten at once in a way the code only iterate once for both.

I reported such changes to the MD5 model code and then it gave me some example to clean-up the IQM code a bit more in return.

There is two changes I have committed separately (see commits about memcpy and memcmp), that's two alternative implementation for a given code, those two commits are there for a question: is there something to gain from them? The idea why I thought about those alternate implementation is that we may assume memcmp and memcpy may have specific optimized code from standard library. I would like to know if it is better or not, so I can squash them or drop them.

Also, I made r_vboModels cvar a latched one, because it is required to restart the engine to see a change.

Basically, people can set r_vboVertexSkinning to 0 to reproduce Unvanquished/Unvanquished#1207 on any hardware.

The issue is that some Unvanquished models have too much bones for the hardware so some fast OpenGL feature cannot be used, despite the feature is supported.

illwieckz · 2020-10-21T16:57:29Z

About the question on memcpy against per-component copy, it looks like compilers produce the exact same code:

https://godbolt.org/z/sYsY6a (clang 6.0)

basically I compared:

    a[ 0 ] = b[ 0 ];
    a[ 1 ] = b[ 1 ];
    a[ 2 ] = b[ 2 ];
    a[ 3 ] = b[ 3 ];

with:

    memcpy( a, b, 4 * sizeof( a[ 0 ] ));

And they both produce the same code:

  mov esi, dword ptr [rsp]
  mov edx, dword ptr [rsp + 4]
  mov ecx, dword ptr [rsp + 8]
  mov r8d, dword ptr [rsp + 12]

So I'll remove that memcpy commit.

Note: the memcpy was used on object with two component, we don't have macro like VectorCopy for that yet.

Edit: well, the compiler is probably doing some optimization for the next function call… I don't know if that's meaningful.

illwieckz · 2020-10-21T17:19:36Z

About the memcmp stuff to check something is zeroed, I tested:

https://godbolt.org/z/7P6dnx (clang 6.0)

    if ( a[ 0 ] == 0 && a[ 1 ] == 0 && a[ 2 ] == 0 && a[ 3 ] == 0 )

with:

    if ( memcmp( a, b, 4 * sizeof( b[ 0 ] ) ) )

and:

    if ( !( a [ 0 ] | a[ 1 ] | a [ 2 ] | a [ 3 ] ) )

And they produced:

  mov eax, dword ptr [rsp]
  or eax, dword ptr [rsp + 4]
  or eax, dword ptr [rsp + 8]
  or eax, dword ptr [rsp + 12]
  jne .LBB0_2

and:

  movdqa xmm0, xmmword ptr [rsp]
  pcmpeqb xmm0, xmmword ptr [rsp + 16]
  pmovmskb eax, xmm0
  cmp eax, 65535
  je .LBB1_2

and:

  mov esi, dword ptr [rsp]
  mov edx, dword ptr [rsp + 4]
  mov eax, edx
  or eax, esi
  mov ecx, dword ptr [rsp + 8]
  or eax, ecx
  mov r8d, dword ptr [rsp + 12]
  or eax, r8d
  jne .LBB2_2

Third is probably way slower, but I don't know if there is something better in first or second. Since first and second implementation have 4 instructions (not including the jump), it looks like the memcmp implementation starts to become useful with 4 components.

Interestingly, with clang11, the first implementation is done the memcmp way:

https://godbolt.org/z/G7xWoz

  movdqa xmm0, xmmword ptr [rsp]
  pcmpeqb xmm0, xmmword ptr [rsp + 16]
  pmovmskb eax, xmm0
  cmp eax, 65535
  je .LBB1_2

Edit: the xmm0 register is an SSE thing, I assume we should favor C/C++ code that makes compiler produce SSE code as much as we can.

illwieckz · 2020-10-21T17:33:19Z

Hmm, for the “compare with zero” code (question about memcmp), GCC seems to produce better code (Edit: but no SSE one):

https://godbolt.org/z/bbxd3o (gcc 10.2)

&& way:

  mov eax, DWORD PTR [rsp]
  or eax, DWORD PTR [rsp+4]
  or eax, DWORD PTR [rsp+8]
  or eax, DWORD PTR [rsp+12]
  je .L5

memcmp way:

  mov rdx, QWORD PTR [rsp+8]
  mov rax, QWORD PTR [rsp]
  xor rdx, QWORD PTR [rsp+24]
  xor rax, QWORD PTR [rsp+16]
  or rdx, rax
  je .L6

| way:

  mov eax, DWORD PTR [rsp]
  or eax, DWORD PTR [rsp+4]
  or eax, DWORD PTR [rsp+8]
  or eax, DWORD PTR [rsp+12]
  je .L15

So with GCC, the && way and the | way produce the same code, and the memcmp code use one more instruction and one more register. So it's better to use the and way, it is more consistent between compilers.

For the “set all component” code (question about memcpy), unfortunately gcc is optimizing so much the first implementation in my sample code is even not done at all, but for the second implementation (the memcmp one), GCC produces this:

https://godbolt.org/z/dT9qG4 (gcc 10.2)

  mov rsi, QWORD PTR [rsp]
  mov rcx, QWORD PTR [rsp+8]

Which looks very efficient, but then again it's probably some optimization for the next function call. But we know GCC optimizes less with that memcmp code.

slipher · 2020-10-22T13:30:37Z

I believe it works something like this:

When the engine main thread calls an OpenGL function, the GL driver copies the data into a graphics operation queue. The GL driver runs all the time, in parallel with the engine thread, to perform any operations in the queue. If the queue is too full, then when the engine thread calls an OpenGL function, it may have to wait for some operations to be completed before there is room to insert the operations into the queue. I saw something similar while investigating Unvanquished/Unvanquished#1109. The GL driver will block the main thread for a while if the graphics processing pipeline gets too far behind.

https://www.khronos.org/opengl/wiki/Synchronization#Asynchronous_action

When the GPU is by far the bottleneck, you will get zero increase in overall throughput by speeding up the engine code.

illwieckz · 2020-10-22T14:09:55Z

Yes, that's what I'm afraid of. What I would like to understand is why when I double framerate when I return right before the // Deform the vertices by the lerped bones. comment in IQM code. At this point the data is garbage so surfaces are badly rendered, but this doubles the framerate…

slipher · 2020-10-22T14:33:05Z

What I would like to understand is why when I double framerate when I return right before the // Deform the vertices by the lerped bones. comment in IQM code.

Good question... maybe the garbage data in tess.verts results in points outside the player's field of view or something, which allows them to be thrown out without further processing?

Perhaps a fairer comparison would be to always take the else clause corresponding to if( model->num_joints > 0 && model->blendWeights && model->blendIndexes ), rather than returning if the condition is true.

illwieckz · 2020-10-22T14:40:43Z

Perhaps a fairer comparison would be to always take the else clause corresponding to if( model->num_joints > 0 && model->blendWeights && model->blendIndexes ), rather than returning if the condition is true.

If I do this the scene runs at 12fps instead of 7fps (this GPU can do 70fps on this map when there is no model on screen).

Anyway, that's how it looks on such hardware:

slipher · 2020-10-23T14:23:56Z

OK now I understand, the code in question only runs if r_vbomodes=0 or some capabilities are missing.

The Tess_SurfaceIQM commit looks good - it should help to reduce pointer aliasing.

With the Tess_SurfaceMD5 commit, it does not seem helpful to me to combine the srf->numTriangles- and srf->numVerts-bounded loops into one. Why do you want to do that?

illwieckz · 2020-10-23T14:30:54Z

Yeah, those loops were some experiments, I plan to revert that on md5mesh side too, and to implement freem's suggests as well, it's just that at this point I only redid properly the IQM code after my first attempt.

illwieckz · 2020-10-23T14:32:43Z

For the loop merge, at first I was looking for a way to reduce the number of loops, but the conditions may just destroy prediction, and anyway I did not noticed any gain, so I'm just reverting them.

illwieckz · 2020-10-23T16:56:40Z

@slipher: I reproduced the pointer optimization on md5mesh code, and moved tests outside of loops.

illwieckz · 2021-06-29T04:14:33Z

I rebased on current master. To remind the context, this work was motivated by Unvanquished/Unvanquished#1207 when I discovered that some GPUs may not be able to process the model transformations because of some limits and then, the engine would rely on a fallback that is too slow to make the game playable.

Most of that PR is to rewrite a lot of code to help compiler optimize better the code, for example by reducing some pointer indirections and reduce branching (so less branching prediction uncertainty). Other parts of that PR is to rewrite both IQM and MD5 code to reduce the most possible differences in their implementations, in a way to make possible to port back and forth optimizations or identify differences that may explain performance differences (maybe in the future it would be possible to use the same code for both? who knows).

It's believed there is still big rooms for optimizations, given:

one scene with md5 acid tubes instead of iqm acid tubes is known to have twice the performances on some old hardware,
ZTM has done heavy optimizations in ioquake3 involving agressive precomputations and we may want to import those one day.

But, this PR is already very big, and I would like to see this merged before I look at other optimizations. I basically need this to be merged so I can count on it.

To test this code, the r_vboModel cvar must be set to 0. Note: when r_vboModel is set to 0 there is a bug where third-person MD3 weapon models are not displayed, that's not related to this code, and when non-VBO model animation is done because of hardware limit (not because that cvar is disabled), they are properly displayed.

While this does not bring the twice performance increase I was looking for, the performance gain is still measurable:

On plat23 map, using lowest preset and r_vboModel, I get on my desktop computer:

Scene: /setviewpos 2314 2124 308 52 45 (camera on alien overmind, two eggs, leech and barricade)

Before: 59 fps
After : 74 fps

Scene: /setviewpos 1893 1920 -5 0 0 (camera from alien base entry corridor, even if not everything is seen, the engine renders 6 acid tubes, 3 eggs, 1 overmind, 1 leech and 1 barricade)

Before: 28
After: 38

So that's already a 20~25% gain.

I remember that on a very old computer the fps went from 10 to 12 on that complex scene (while the same old computer can do 40 fps when there is none of such non-hardware accelerated models), which confirms the 20% gain. I'm still dreaming of doubling the performances though…

illwieckz · 2021-06-29T04:26:01Z

Note: given comment there: #389 (comment) the MD5 animation is now also a bit faster when done on CPU.

also, a variable was used before initialization

lsalzman said: > A 0 weight has no influence on the joint, because the influence is multiplied by 0. This is not perfect as it is like saving two fps when we can multiply fps by two instead. The best would be to precompute things at load time in order to prevent branching within loops. See DaemonEngine#390.

@slipher

@slipher said: TransInit(bone); TransAddScale(entityScale, bone); is equivalent to TransInitScale(entityScale, bone);

illwieckz · 2021-07-11T18:35:56Z

I tested on an old Intel integrated (Core 2 Duo L7500 + X3100 / GMA965), on this hardware, the Linux driver emulates the GPU vertex animation, which is the slowest implementation, as we would get only 8 fps on the heavy plat23 alien base scene reference:

With our own CPU alternative code (r_vboVertexSkinning 0), current master gets 13 fps:

With this IQM revamp, 15 fps is reached:

With this IQM revamp plus q_math inline (#505), 16 fps is reached:

This scene (+devmap plat23 +delay 100f setviewpos 1893 1920 0 0 0) is very heavy to render, there are twelve animated buildables to process: one overmind, one leech, one barricade, six tubes, and three eggs.

After this PR and #505, there are two more optimizations paths to explore (both can be done), turning q_math inline functions into SSE code, and porting ZTM's optimizations from ioquake3 as described there: #390 (comment)

Note: to test this code path, one can set r_vboVertexSkinning to 0.

illwieckz · 2021-07-12T10:48:39Z

Status of the PR is assumed to be ready to merge.

The proposed changes are meant to improve performance without intention to change the underlying logic. Except some compute being moved from run time to load time, there was no intention to rethink the way it is done.

The proposed changes are now deeply tested over various hardware (Edit: and drivers including Nvidia proprietary one) and using compilers we use for release (gcc and clang) but I only tested on Linux, though I doubt there would be operating-system specificity there.

Proposed changes were addressed and suggests applied.

Even if not everything got a comment, at this point I even don't know what can be left for a review.

I see no reason to delay the merge more so if no one has something to say I'll merge it tomorrow.

illwieckz · 2021-07-18T09:16:40Z

People wanting to use that code can set r_vboVertexSkinning to 0.

In the past I talked about r_vboModels 0 but this other one also has unwanted effects on other things.

The exact cvar to enable the CPU path for model animation is r_vboVertexSkinning 0.

illwieckz added T-Improvement Improvement for an existing feature A-Renderer T-Cleanup T-Question labels Oct 20, 2020

illwieckz force-pushed the iqm branch 8 times, most recently from 899dd3b to 9eaeb61 Compare October 22, 2020 12:08

illwieckz force-pushed the iqm branch from 9eaeb61 to 4fbc485 Compare October 23, 2020 16:52

illwieckz force-pushed the iqm branch from 4fbc485 to 8222c22 Compare October 23, 2020 17:05

illwieckz mentioned this pull request Oct 23, 2020

Non-VBO IQM model code is twice slower than MD5 model code #390

Open

illwieckz force-pushed the iqm branch 4 times, most recently from 3a31e66 to bdb6aa9 Compare October 24, 2020 00:24

illwieckz mentioned this pull request May 25, 2021

z-fighting black artifacts on models using an nvidia gpu with nvidia's closed-source driver #472

Closed

illwieckz changed the title ~~some revamp on IQM and MD5 model code~~ WIP: some revamp on IQM and MD5 model code May 31, 2021

illwieckz marked this pull request as draft May 31, 2021 22:05

illwieckz changed the base branch from 0.52.0/sync to master May 31, 2021 22:41

illwieckz force-pushed the iqm branch from 85eac42 to 01e9389 Compare June 29, 2021 03:47

illwieckz changed the title ~~WIP: some revamp on IQM and MD5 model code~~ some revamp on IQM and MD5 model code Jun 29, 2021

illwieckz marked this pull request as ready for review June 29, 2021 04:15

This was referenced Jul 3, 2021

reduce number of warnings #468

Merged

SSE/AVX alternative to lrintf and other things like that? Or use a third party library others have well optimized? #494

Open

illwieckz force-pushed the iqm branch from 01e9389 to 8a70c25 Compare July 10, 2021 14:09

illwieckz added 11 commits July 11, 2021 17:59

tr_init: set r_vboModels a latched cvar

b631c40

tr_surface: reduce indirections in iqm loops

0ac8c52

tr_surface: reduce indirections in md5 loops

2663708

also, a variable was used before initialization

tr_model_md5,tr_surface: precompute md5 bone transform like iqm

e57867a

tr_surface: implement iqm skipTangentSpace alternative like md5

bf11e30

tr_surface: reduce indirections in vbomd5 loops

520c33e

tr_surface: use precomputed vbomd5 bone transform like md5

65b01c9

tr_surface: implement vbomd5 skipTangentSpace alternative like md5

f77a0d7

tr_surface: simplify code

3f9404c

@slipher said: TransInit(bone); TransAddScale(entityScale, bone); is equivalent to TransInitScale(entityScale, bone);

tr_model_iqm,tr_surface: precompute null iqm weights

9df7a93

illwieckz force-pushed the iqm branch from 8a70c25 to 39089a4 Compare July 11, 2021 17:58

illwieckz mentioned this pull request Jul 11, 2021

qcommon: create q_math.h and transform most functions as inline ones #505

Closed

illwieckz force-pushed the iqm branch from 39089a4 to 9df7a93 Compare July 11, 2021 18:23

illwieckz merged commit 0cd380a into DaemonEngine:master Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some revamp on IQM and MD5 model code #389

some revamp on IQM and MD5 model code #389

illwieckz commented Oct 20, 2020 •

edited

Loading

illwieckz commented Oct 21, 2020 •

edited

Loading

illwieckz commented Oct 21, 2020 •

edited

Loading

illwieckz commented Oct 21, 2020 •

edited

Loading

slipher commented Oct 22, 2020

illwieckz commented Oct 22, 2020

slipher commented Oct 22, 2020

illwieckz commented Oct 22, 2020 •

edited

Loading

slipher commented Oct 23, 2020

illwieckz commented Oct 23, 2020 •

edited

Loading

illwieckz commented Oct 23, 2020

illwieckz commented Oct 23, 2020

illwieckz commented Jun 29, 2021 •

edited

Loading

illwieckz commented Jun 29, 2021 •

edited

Loading

illwieckz commented Jul 11, 2021 •

edited

Loading

illwieckz commented Jul 12, 2021 •

edited

Loading

illwieckz commented Jul 18, 2021

some revamp on IQM and MD5 model code #389

some revamp on IQM and MD5 model code #389

Conversation

illwieckz commented Oct 20, 2020 • edited Loading

illwieckz commented Oct 21, 2020 • edited Loading

illwieckz commented Oct 21, 2020 • edited Loading

illwieckz commented Oct 21, 2020 • edited Loading

slipher commented Oct 22, 2020

illwieckz commented Oct 22, 2020

slipher commented Oct 22, 2020

illwieckz commented Oct 22, 2020 • edited Loading

slipher commented Oct 23, 2020

illwieckz commented Oct 23, 2020 • edited Loading

illwieckz commented Oct 23, 2020

illwieckz commented Oct 23, 2020

illwieckz commented Jun 29, 2021 • edited Loading

illwieckz commented Jun 29, 2021 • edited Loading

illwieckz commented Jul 11, 2021 • edited Loading

illwieckz commented Jul 12, 2021 • edited Loading

illwieckz commented Jul 18, 2021

illwieckz commented Oct 20, 2020 •

edited

Loading

illwieckz commented Oct 21, 2020 •

edited

Loading

illwieckz commented Oct 21, 2020 •

edited

Loading

illwieckz commented Oct 21, 2020 •

edited

Loading

illwieckz commented Oct 22, 2020 •

edited

Loading

illwieckz commented Oct 23, 2020 •

edited

Loading

illwieckz commented Jun 29, 2021 •

edited

Loading

illwieckz commented Jun 29, 2021 •

edited

Loading

illwieckz commented Jul 11, 2021 •

edited

Loading

illwieckz commented Jul 12, 2021 •

edited

Loading