New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gsdx sw: rcp introduce bad rounding on the grandiant #1843

Merged
merged 3 commits into from Mar 10, 2017

Conversation

Projects
None yet
6 participants
@gregory38
Contributor

gregory38 commented Feb 27, 2017

Division is slower but more accurate. Fix rendering issue on Xenosaga (batte slot)/Jak3 (skin color)

Geez need to edit the commit to link the issue.
Issue #1769

TODO

  • add comment
  • use accurate div on sw ref implementation
  • add option for sw fragment shader
  • add template parameter for VertexTrace
  • automatically enable previous options when Q is big enough
plugins/GSdx/GSRasterizer.cpp Outdated
@@ -464,7 +462,7 @@ void GSRasterizer::DrawTriangle(const GSVertexSW* vertex, const uint32* index)
ddx[1] = ddx[0].yxzw();
ddx[2] = ddx[0].xzyw();
GSVector8 _dxy01c(dxy01 * cross);
GSVector8 _dxy01c(dxy01 / cross);

This comment has been minimized.

@gregory38

gregory38 Feb 28, 2017

Contributor

Need to add comment of rcp vs div

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Feb 28, 2017

@sudonim1

The issue #551 is potentially related to inaccurate rcp operation on the VertexTrace (and likely in the fragment shader too)

But Gsdx isn't good neither.

    S/T = 3.40282347e+38 (biggest float)
    Q = 3.40282347e+38

GSdx requires to compute S/Q (and T/Q)

    1/Q gives 0 due to float error
    S * (1/Q) gives you 0 instead of 1

I looked this morning at Agner manual. And div is really slow on AMD Jaguar CPU ~40 cycles. I need to check recent AMD CPU. Intel's Sandy is 28. Recent Intel (at least Haswell and maybe IB) is 14. So rcp isn't a so good idea nowadays. It would be nice to find a good benchmark (high number of polygons + texture + ST coordinate)

@sudonim1

This comment has been minimized.

Contributor

sudonim1 commented Feb 28, 2017

If it comes down to it, it can be an option. The benchmark would just be any high poly textured 3D scene.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Feb 28, 2017

I think for the JIT I will try to detect the CPU type. (Maybe based on AVX 1 or 2 support). Per fragment operation is much more costly than per primitive operation.

Initially I wanted to look SotC but it uses cell shading not texture.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Feb 28, 2017

Question remain how to handle this case ? 3 divisions or do a 1.0f/q that hopefully won't be optimized to rpc. I'm afraid that 3 div will stall the pipeline. However it might handle better huge/small value.

GSVector4 q = stq0.wwww(stq1).xzww(stq2).rcpnr();

stq0 = (stq0.xyww() * q.xxxx()).xyww(stq0);
stq1 = (stq1.xyww() * q.yyyy()).xyww(stq1);
stq2 = (stq2.xyww() * q.zzzz()).xyww(stq2);

Edit: AVX512 has an extension to support a 23 bits (on 24) precision for rcp => VRCP28PS
Edit2: perf of div on newer CPU https://software.intel.com/sites/default/files/managed/28/11/Intel_32bit_FP_Divide_Throughput.png (Intel actually worked quite hard)

@mirh

This comment has been minimized.

mirh commented Feb 28, 2017

Just for the records, Jaguar might even be last AMD architecture (til tomorrow at least :s), but it was meant to be a low budget, low power one. I'm not saying it's an atom, but compromises might have been made.
EDIT: oh noes, my C2D sucks hard :c

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Feb 28, 2017

Current commit is fine. But if we replace more code. We gonna replace an reciprocal+3 mult by 3 div. Which is slower on all arch.Now I don't know the real impact of game fps.

@woj1993

This comment has been minimized.

Contributor

woj1993 commented Mar 1, 2017

@gregory38 If you got ready code, then why don't test speed impact?
PS. Gregory I don't know is it the same error I suspect this to be Z fighting but Megaman X Command Mission have issues in SW and HW modes. In SW mode it looks like small miscalculation. Can you check is it related to this?

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 1, 2017

First I'm busy with real life. Then I'm still looking at all the implication. For example the fragment shader uses low accuracy operation.

rcpps(xmm0, xmm4); // 1/Q
mulps(xmm2, xmm0); // S * (1 / Q)
mulps(xmm3, xmm0);  // T * (1 / Q)

2 division will be 30-50 times slower to execute than the above code. So I think I will need to add an option.

Third we need to find a good benchmark. If the game doesn't hit the code path => no speed impact. If the game mostly hit the code path => speed impact.

PS. Gregory I don't know is it the same error I suspect this to be Z fighting but Megaman X Command Mission have issues in SW and HW modes. In SW mode it looks like small miscalculation. Can you check is it related to this?

It is much more complicated. The data are coming from the VU. The data are already rounded by VU, GSdx will round them further. Let's first fixes the issue related to rcp vs div.

@gregory38 gregory38 added the WIP label Mar 1, 2017

@gregory38 gregory38 force-pushed the greg/gsdx-sw-rcp-rounding branch Mar 3, 2017

@gregory38 gregory38 removed the WIP label Mar 3, 2017

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 3, 2017

@sudonim1 @turtleli Could you review the code.

Not implemented

  • SW renderer: option to select div operation. So far my testcase got a constant Q so the div is done on the vertex level.
  • HW renderer: Optimization of constant Q on the CPU to avoid float overflow on GPU. Bigger code updates, so far it is required by 1 game which isn't fully fixed on the SW renderer. So I propose to postpone it.

@gregory38 gregory38 added this to the Release 1.6 milestone Mar 3, 2017

@sudonim1

I thiiiiink this is okay? I'm not in depth enough on the software renderer to be 100% confident but at least nothing's obviously wrong.

plugins/GSdx/GSRendererSW.cpp Outdated
}
// skip per pixel division if q is constant
// Note: the Q division was done GSRendererSW::ConvertVertexBuffer
gd.sel.fst |= (m_vt.m_eq.q || primclass == GS_SPRITE_CLASS);

This comment has been minimized.

@sudonim1

sudonim1 Mar 4, 2017

Contributor

Doesn't check for 1.0 Q unlike GSRendererSW::Draw? It should be a no-op so I guess it won't affect emulation other than speed.

This comment has been minimized.

@gregory38

gregory38 Mar 5, 2017

Contributor

Because in GSRendererSW::Draw the division by 1 could replaced by a nop. So you don't actually need to do it.
So whatever the Q value, if Q is constant you could avoid the division in the Fragment Shader.

This comment has been minimized.

@sudonim1

sudonim1 Mar 5, 2017

Contributor

I was just having trouble following the slightly different logic for pre-division and flat ST for texture mapping. The code's correct. Flat ST handling if not mipmapping and (Q is constant or drawing sprites), pre-division with the same conditions but skips q=1 for efficiency, got it.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 5, 2017

I need to benchmark the speed impact of the pre Q division. Code is likely slower but I don't think we will do the pre division often.

I will push the first commit in master because it fixes at least 2 games. And there is no way to have a faster solution.

plugins/GSdx/GSRendererSW.cpp Outdated
}
// skip per pixel division if q is constant
// Note: the Q division was done GSRendererSW::ConvertVertexBuffer
gd.sel.fst |= (m_vt.m_eq.q || primclass == GS_SPRITE_CLASS);

This comment has been minimized.

@sudonim1

sudonim1 Mar 5, 2017

Contributor

I was just having trouble following the slightly different logic for pre-division and flat ST for texture mapping. The code's correct. Flat ST handling if not mipmapping and (Q is constant or drawing sprites), pre-division with the same conditions but skips q=1 for efficiency, got it.

@turtleli

I don't quite follow part of the last commit (const Q div commit), though that's because I'm unfamiliar with the GS rendering stuff and I didn't dig into the code. Otherwise it seems fine to me.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 5, 2017

@turtleli
https://en.wikipedia.org/wiki/Texture_mapping#Perspective_correctness

My understanding is that S == u/z, T == v/z, Q == 1/z. In 2d rendering, u and v are interpolated from the vertex position. For correct 3d rendering, you need to take into account the depth value. So the GS will compute interpolation of S / interpolation of Q. If Q is uniform on the primitive, the interpolation of Q == Q. And alpha * interpolation of x == interpolation of alpha * x.

Remain a strange case, the sprite case. Sprites are 2d rectangle without depth. So it doesn't make sense to do a perpective correction. However sudonim1 is likely right when he said that Q can be set to any values.

My latest commit only moves code around. ST are normalized coordinate (without wrapping range from 0 to 1), so you need to multiply the coordinate by the size of the texture. And then do the division by Q. However if ST are huge. ST * size is infinite and ST * size / Q is bad. My commit tries to do the Q division first. So (ST / Q) * size gives a standard number. My commit also use / x instead of * 1/x to avoid accuracy issue but the instruction is slower.

@prafullpcsx2

This comment has been minimized.

Contributor

prafullpcsx2 commented Mar 5, 2017

This seems to fix #1684

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 5, 2017

@prafullpcsx2 thanks for the info. I pushed the first commit of this PR to master, could you tell me if it is enough. If not, could you share with us some gsdump, the one from the issue report are expired. Long story short, I found some bug in my PR on the sprite handling (but less buggy than master by pure luck). I want to be sure that I understand the expected behavior correctly.

@prafullpcsx2

This comment has been minimized.

Contributor

prafullpcsx2 commented Mar 5, 2017

@gregory38 First commit of this PR doesn't fix it. I will upload the gsdump later. I do have a block dump ready though incase you want to use it. http://www.filedropper.com/ga_1

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 5, 2017

@prafullpcsx2 ok. I think I have a quick fix that could explain some bugs. I will push it soon.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 5, 2017

@prafullpcsx2 yes, you're right. The game uses huge float.

 Vertex Trace: float overflow detected ! min 3.402823e+38 max 3.402823e+38

Now, I know that at least 2 games are affected. Unfortunately I need to redo/fix my code to handle the Q properly for sprite.

@gregory38 gregory38 force-pushed the greg/gsdx-sw-rcp-rounding branch Mar 5, 2017

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Mar 5, 2017

edit: ~~~Please don't test the branch now, I break ATV again.~~~ Just silly code, it is good now.

gregory38 added some commits Feb 28, 2017

gsdx sw: use accurate division for the C reference implementation
Speed isn't important here. It would allow to compare the rendering with
the JIT implementation. If it is necessary we could an option for the JIT.
gsdx: handle float overflow on Q in vertex trace
Replace the fast reciprocal with a slower division when we detect a too big Q value.

Improve #551, #1684
gsdx sw: do const q division in ConvertVertexBuffer
It allow to do the division before the size multiplication
It avoid a float overflow if T is too big.
Old behavior: (T * size) / Q
New behavior: (T / Q) * size

Performance Note:
* Rcp was replaced by a slow division (more accurate)
* At least we avoid a 2nd loop on the vertex buffer

It helps on Pro Soccer Club and Galerians Ash rendering

Tric Note:
SPRITE must be handled differently because the 'q' of first vertex could
be invalid

@gregory38 gregory38 force-pushed the greg/gsdx-sw-rcp-rounding branch from 3333f0f to 9a9c63f Mar 10, 2017

@gregory38 gregory38 merged commit 6d6ed1a into master Mar 10, 2017

4 checks passed

continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@gregory38 gregory38 deleted the greg/gsdx-sw-rcp-rounding branch Apr 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment