-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve approximate xfloat #6894
Conversation
plappermaul
commented
Oct 26, 2019
•
edited
Loading
edited
- Disable denormals for SPU threads
- add clamping helper
- rewrite multiply_zero()
- make use of FMA if possible
|
Yes, PR fixes #6616, and works with other games so far, but why removing Default path? |
default path restored in updated PR |
|
I'm not sure this is good place for requests but anyway, it will be nice if approximate version of FS will be added, it used by Just Cause 2 to render graphichs and by Uncharted 1 for grenade sequence more info is here in my old issue #4478. |
|
Hopefully Windows integration works again now... @Xcedf: Regarding FS() problems. We want to make approximate xfloat as good as possible. So we should implement every needed fix. To keep overhead as small as possible I want to ask you to test the following things: FS(): only clamp one of the two variables. If that does not help, clamp the result. Maybe we can live with only on fix. FM(): remove clamping. FMA(), FNMS() and FMS() do not need it, sow why should FM()? Maybe we are fine with the new multiply_zero() function. |
|
@plappermaul don't have time to test FM today, will do tomorrow, as for FS: |
|
clamping for FS() operand b added to patch. |
|
@plappermaul unlike FS instruction FM is used by great number of games, tested some, for now found only one small difference, in TLoU of course: |
|
@Xcedf thanks for your patience and help. I rearranged the code a little bit so we can make use of hardware FMA in the xfloat code path. Maybe you can run a before/after performance comparison with the latest PR code. Other readers of this thread are invited to give numbers too. |
|
Hmmmmmm ... I just recognize that the old FM() code clamps the wrong way: fmax_m shoud be 7f7fffff. That leaves the question if that is REALLY needed .... |
|
Did a few tests framerats are pretty much the same, found a little difference will do more tests tomorrow to tell exactly. |
|
Good to hear. In between I thought of another version of the new xmuladd() function. As an additional test you could give this a try: While rewriting this all we might squeeze some more fps out of it. |
|
The result is identical, and about the difference i've found previously: i think i've compiled smth wrong, PR works good, no regressions. |
|
Thanks. Will do some additional checks myself and post a final update to the PR later on. |
|
I'm finally fine with the patch. |
| // FMA favouring zeros | ||
| value_t<f32[4]> xmuladd(value_t<f32[4]> a, value_t<f32[4]> b, value_t<f32[4]> c) | ||
| { | ||
| const auto ma = eval(sext<s32[4]>(fcmp_uno(a != fsplat<f32[4]>(0.)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it essentially a no-op here? Unordered comparison with zero. Maybe I'm confusing something, I don't get what cases it affects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks if the variable is not zero. We can distinguish 3 cases here:
- abs(a) < 0x7fffff => mask = 0 (because auf denormals deactivation)
- abs(a) between 0x80000 and 0x7f7fffffff => mask = ffffffff
- NaN => mask = fffffffff (in case someone feeds large constant values into the function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but why would it affect multiplication result?
| const auto mb = eval(sext<s32[4]>(fcmp_uno(b != fsplat<f32[4]>(0.)))); | ||
| const auto ca = eval(bitcast<f32[4]>(bitcast<s32[4]>(a) & mb)); | ||
| const auto cb = eval(bitcast<f32[4]>(bitcast<s32[4]>(b) & ma)); | ||
| return eval(fmuladd(ca, cb, c)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is an opportunistic FMA which means it may produce different results on different CPU arch. It's used on Accurate Xfloat path because it's not a problem when using doubles to store extended range floats, but it doesn't seem so for Approximate path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could give an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you can set CPU: ivybridge or similar and test. I'm sure the results will be different.
|
Rebased PR with some changes, needs testing. |
|
I did some more tests recently with UC1 with FS only b clamped some dead npcs still can act wierd, while other remain dead, but with both a and b clamped everything's fine |
- Disable denormals for SPU threads - Add clamping helpers


