-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Performance improvement of Base.Math.expm1 by using Paterson-Stockmayer #40870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A closer look at the difference between
So, the performance difference seems to stem from the difference in computation effort of |
|
Is there a way to get more information about what went wrong with the tests? A Documenter.jl error suggests it is a doctest? |
|
The error appears to happen elsewhere: |
What do you mean? The stacktrace is: Looks pretty relevant to the PR? |
base/special/exp.jl
Outdated
| return evalpoly(x, (1.0f0, 0.6931472f0, 0.2402265f0, | ||
| 0.05550411f0, 0.009618025f0, | ||
| 0.0013333423f0, 0.00015469732f0, 1.5316464f-5)) | ||
| return evalpoly_ps8(x, (1.0f0, 0.6931472f0, 0.2402265f0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are degree 7, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thanks. Fixed. I will provide some comparisons for the kernel functions as well to see if it makes sense to add 0 coeffs like I did now.
|
I added a zero coefficient when evaluating polynomials of degree 7 rather than treating degree 7 separately. The compiler seems to take care if it since nothing seems to be lost in terms of performance: @inline function evalpoly_ps8(x,c)
xx = x * x # x^2
xxx = xx * x # x^3
X_987 = muladd(c[9], xx, muladd(c[8], x, c[7]))
X_654 = muladd(xx, c[6], muladd(c[5], x, c[4]))
Y_1 = muladd(c[3], xx, muladd(c[2], x, c[1]))
Y = muladd(xxx, muladd(xxx, X_987, X_654), Y_1)
return Y
end
@inline function evalpoly_ps7(x,c)
xx = x * x # x^2
xxx = xx * x # x^3
X_87 = muladd(c[8], x, c[7]) # c[7]+c[8]*x
X_654 = muladd(xx, c[6], muladd(c[5], x, c[4])) # c[4]+c[5]*x+c[6]*x^2
Y_1 = muladd(c[3], xx, muladd(c[2], x, c[1])) # c[1]+c[2]*x+c[3]*x^2
Y = muladd(xxx, muladd(xxx, X_87, X_654), Y_1)
return Y
end
@inline function expb_kernel_new_ps7(x::Float32)
return evalpoly_ps7(x, (1.0f0, 0.6931472f0, 0.2402265f0,
0.05550411f0, 0.009618025f0,
0.0013333423f0, 0.00015469732f0, 1.5316464f-5))
end
@inline function expb_kernel_new_ps8(x::Float32)
return evalpoly_ps8(x, (1.0f0, 0.6931472f0, 0.2402265f0,
0.05550411f0, 0.009618025f0,
0.0013333423f0, 0.00015469732f0, 1.5316464f-5, 0))
end
@inline function expb_kernel_old(x::Float32)
return evalpoly(x, (1.0f0, 0.6931472f0, 0.2402265f0,
0.05550411f0, 0.009618025f0,
0.0013333423f0, 0.00015469732f0, 1.5316464f-5))
endGives: |
|
the old and new functions return different values |
|
This is due to rounding errors. The result still seems to be within machine precision in a relative sense so I think this should be tolerated: |
|
from a recent Slack #maths thread:
|
|
If this is a consistent performance improvement, it's an OK change from an accuracy perspective. |
|
That being said, I would much prefer this PR if it modified |
Me too. I think an advantage of this approach could be adapted to a horner scheme if the horner evalutation is only done for the higher coefficients: Note: Formula can be evaluated with 2 muls and s fmas, in contrast to current implementation of However, if I would PR |
|
This is fine for a PR to |
|
This is really interesting. I've hard-coded the degree 21 version of this, and can confirm it's noticeably faster (it lets me do |
Okay. Do you mean the PS approach or the approach where we precompute, Here is PS8 in formulas It is still not entirely clear to me why this is faster. Earlier I thought the advantage over horner eval was that |
|
yeah. The code is I'm pretty sure the reason this is faster is that it can use more execution ports due to being more paralellizable. |
|
So what would be needed to get this type of code into Note that PS is actually a generalization of the horner scheme if we see |
|
The error analysis is not too hard here. For polynomials, there are 2 cases for numerical accuracy. If the sequence |
Why? I understand constant coefficients and generated code can lead to further optimization. Why would there be no advantage for arbitrary Float variables? |
|
Good point. just works as is.The last thing is to benchmark to see if we want to force |
|
The code doesn't run for me. Shouldn't |
|
Oops. It should actually be |
|
According to my tests, we should switch to PS at degree 8. I had some difficulties not making the compiler compute the solution directly, so not completely sure it is correct. Messy code here https://pastebin.com/yMep0Ywj Strange thing: The difference between the computed values for |
|
Interesting. The accuracy issues here are bigger than I would have though. Part of the problem was that I was generating |
|
On second thought, I think there is a bug somewhere in the generated code. With degree 8 we get this generated code: It contains both |
|
Good catch. Here's an actually correct version that gives 2e-17 error. |
|
Now the only problem is that it appears this is now slower for degree 9 at least :( I think it's because I'm adding a multiply by zero that I could get rid of. |
On many CPUs, |
|
It's quite hard for me to read the code. How about separating this into a separate function? In fact, it is just horner evaluation with appropriate pX as coefficients. |
Thanks. Yes. We figured this out later in the thread. This thread is a bit long and would be better in a separate |
|
Closing in favor of #40919 |
This is a performance improvement of
Base.Math.expm1by using the Paterson-Stockmayer for polynomial evaluation of polynomials of degree 8.Performance MWE:
Gives:
The reason it is faster is discussed here: https://discourse.julialang.org/t/fmas-and-poly-eval/61421 and #40867
For discussion:
Further testing are needed to make sure it also makes sense for theexpb_kernel.evalpolyinspecial/exp.jl?evalpoly, but it would be more intrusive and raise more difficult completeness questions (for which poly degrees is it better etc)