Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve approximate xfloat #6894

Merged
merged 1 commit into from
Dec 22, 2019
Merged

Improve approximate xfloat #6894

merged 1 commit into from
Dec 22, 2019

Conversation

plappermaul
Copy link
Contributor

@plappermaul plappermaul commented Oct 26, 2019

  • Disable denormals for SPU threads
  • add clamping helper
  • rewrite multiply_zero()
  • make use of FMA if possible

rpcs3/Emu/CPU/CPUThread.cpp Outdated Show resolved Hide resolved
rpcs3/Emu/Cell/SPURecompiler.cpp Outdated Show resolved Hide resolved
@Xcedf
Copy link

Xcedf commented Oct 26, 2019

Yes, PR fixes #6616, and works with other games so far, but why removing Default path?

@plappermaul
Copy link
Contributor Author

Yes, PR fixes #6616, and works with other games so far, but why removing Default path?

default path restored in updated PR

@Xcedf
Copy link

Xcedf commented Oct 27, 2019

I'm not sure this is good place for requests but anyway, it will be nice if approximate version of FS will be added, it used by Just Cause 2 to render graphichs and by Uncharted 1 for grenade sequence more info is here in my old issue #4478.
For me it's not a problem to build for myself and switch between different instructios but i'm sure for average user use Accurate Xfloat or be even with severe bugs is quite eshausting.
based on your changes i tried smth and it worked:
const auto a = eval(clamp_smax(get_vr<f32[4]>(op.ra)));
const auto b = eval(clamp_smax(get_vr<f32[4]>(op.rb)));
set_vr(op.rt, a - b);

@plappermaul
Copy link
Contributor Author

Hopefully Windows integration works again now...

@Xcedf: Regarding FS() problems. We want to make approximate xfloat as good as possible. So we should implement every needed fix. To keep overhead as small as possible I want to ask you to test the following things:

FS(): only clamp one of the two variables. If that does not help, clamp the result. Maybe we can live with only on fix.

FM(): remove clamping. FMA(), FNMS() and FMS() do not need it, sow why should FM()? Maybe we are fine with the new multiply_zero() function.

@Xcedf
Copy link

Xcedf commented Oct 27, 2019

@plappermaul don't have time to test FM today, will do tomorrow, as for FS:
clamping only a - not working
clamping only b - working
clamping res - not working

@plappermaul
Copy link
Contributor Author

clamping for FS() operand b added to patch.

@Xcedf
Copy link

Xcedf commented Oct 28, 2019

@plappermaul unlike FS instruction FM is used by great number of games, tested some, for now found only one small difference, in TLoU of course:
non-clamped FM only: hair is missing
1
non-clamped FM + FMA: everything fine, but slightly slower
2
clamped FM only: fine and fast
3
Of course this the result of custom build on, normal build with whole Approx xfloat instruction set everything should be fine as on second picture, but still not sure that clap function is fine to remove

@plappermaul
Copy link
Contributor Author

@Xcedf thanks for your patience and help. I rearranged the code a little bit so we can make use of hardware FMA in the xfloat code path. Maybe you can run a before/after performance comparison with the latest PR code.

Other readers of this thread are invited to give numbers too.

@plappermaul
Copy link
Contributor Author

Hmmmmmm ... I just recognize that the old FM() code clamps the wrong way:

const auto fmax_m = eval((sign_a ^ sign_b) | 0x7fffffff);
const auto clamp = select(smod_m > 0x7f7fffff, bitcast<f32[4]>(fmax_m), m);

fmax_m shoud be 7f7fffff. That leaves the question if that is REALLY needed ....

@Xcedf
Copy link

Xcedf commented Oct 29, 2019

Did a few tests framerats are pretty much the same, found a little difference will do more tests tomorrow to tell exactly.

@plappermaul
Copy link
Contributor Author

Good to hear. In between I thought of another version of the new xmuladd() function. As an additional test you could give this a try:

value_t<f32[4]> xmuladd(value_t<f32[4]> a, value_t<f32[4]> b, value_t<f32[4]> c)
{
  const auto ma = sext<s32[4]>(fcmp_uno(fsplat<f32[4]>(0.) != a));
  const auto mb = sext<s32[4]>(fcmp_uno(fsplat<f32[4]>(0.) != b));
  const auto ca = eval(bitcast<f32[4]>(bitcast<s32[4]>(a) & mb));
  const auto cb = eval(bitcast<f32[4]>(bitcast<s32[4]>(b) & ma));
  return eval(fmuladd(ca, cb, c));
}

While rewriting this all we might squeeze some more fps out of it.

@Xcedf
Copy link

Xcedf commented Oct 30, 2019

The result is identical, and about the difference i've found previously: i think i've compiled smth wrong, PR works good, no regressions.

@plappermaul
Copy link
Contributor Author

Thanks. Will do some additional checks myself and post a final update to the PR later on.

@plappermaul
Copy link
Contributor Author

I'm finally fine with the patch.

// FMA favouring zeros
value_t<f32[4]> xmuladd(value_t<f32[4]> a, value_t<f32[4]> b, value_t<f32[4]> c)
{
const auto ma = eval(sext<s32[4]>(fcmp_uno(a != fsplat<f32[4]>(0.))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it essentially a no-op here? Unordered comparison with zero. Maybe I'm confusing something, I don't get what cases it affects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks if the variable is not zero. We can distinguish 3 cases here:

  • abs(a) < 0x7fffff => mask = 0 (because auf denormals deactivation)
  • abs(a) between 0x80000 and 0x7f7fffffff => mask = ffffffff
  • NaN => mask = fffffffff (in case someone feeds large constant values into the function)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why would it affect multiplication result?

const auto mb = eval(sext<s32[4]>(fcmp_uno(b != fsplat<f32[4]>(0.))));
const auto ca = eval(bitcast<f32[4]>(bitcast<s32[4]>(a) & mb));
const auto cb = eval(bitcast<f32[4]>(bitcast<s32[4]>(b) & ma));
return eval(fmuladd(ca, cb, c));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is an opportunistic FMA which means it may produce different results on different CPU arch. It's used on Accurate Xfloat path because it's not a problem when using doubles to store extended range floats, but it doesn't seem so for Approximate path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could give an example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you can set CPU: ivybridge or similar and test. I'm sure the results will be different.

@Nekotekina
Copy link
Member

Rebased PR with some changes, needs testing.

@Xcedf
Copy link

Xcedf commented Dec 21, 2019

I did some more tests recently with UC1 with FS only b clamped some dead npcs still can act wierd, while other remain dead, but with both a and b clamped everything's fine

- Disable denormals for SPU threads
- Add clamping helpers
@Nekotekina Nekotekina merged commit a36f049 into RPCS3:master Dec 22, 2019
elad335 added a commit to elad335/rpcs3 that referenced this pull request Dec 22, 2019
@Xcedf Xcedf mentioned this pull request Dec 23, 2019
Nekotekina pushed a commit that referenced this pull request Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants