Skip to content

Conversation

@jarlebring
Copy link
Contributor

@jarlebring jarlebring commented May 19, 2021

This is a performance improvement of Base.Math.expm1 by using the Paterson-Stockmayer for polynomial evaluation of polynomials of degree 8.

Performance MWE:

using BenchmarkTools
import Base.Math.exthorner;

# Hardcoded version of of Paterson-Stockmayer for polys of degree 8
@inline function evalpoly_ps8(x,c)
    xx = x * x   # x^2
    xxx = xx * x  # x^3
    X_987 = muladd(c[9], xx, muladd(c[8], x, c[7]))
    X_654 = muladd(xx, c[6], muladd(c[5], x, c[4]))
    Y_1 = muladd(c[3], xx, muladd(c[2], x, c[1]))
    Y = muladd(xxx, muladd(xxx, X_987, X_654), Y_1)
    return Y
end


@inline function expm1_small_old(x::Float64)
    p = evalpoly(x, (0.16666666666666632, 0.04166666666666556, 0.008333333333401227,
                         0.001388888889068783, 0.00019841269447671544, 2.480157691845342e-5,
                         2.7558212415361945e-6, 2.758218402815439e-7, 2.4360682937111612e-8))
    p2 = exthorner(x, (1.0, .5, p))
    return fma(x, p2[1], x*p2[2])
end


@inline function expm1_small_new(x::Float64)
    p = evalpoly_ps8(x, (0.16666666666666632, 0.04166666666666556, 0.008333333333401227,
                         0.001388888889068783, 0.00019841269447671544, 2.480157691845342e-5,
                         2.7558212415361945e-6, 2.758218402815439e-7, 2.4360682937111612e-8))
    p2 = exthorner(x, (1.0, .5, p))
    return fma(x, p2[1], x*p2[2])
end

Gives:

julia> expm1_small_new(0.3)-expm1_small_old(0.3)
0.0
julia> x=0.3
julia> @btime expm1_small_new($x)
  8.958 ns (0 allocations: 0 bytes)
julia> @btime expm1_small_old($x)
  10.778 ns (0 allocations: 0 bytes)
julia> @btime Base.Math.expm1_small($x) # Same as "old"
  10.769 ns (0 allocations: 0 bytes)

The reason it is faster is discussed here: https://discourse.julialang.org/t/fmas-and-poly-eval/61421 and #40867

For discussion:

  • Further testing are needed to make sure it also makes sense for the expb_kernel.
  • Does the same make transformation make sense for other evalpoly in special/exp.jl?
  • Another option is to put the PS-approach into evalpoly, but it would be more intrusive and raise more difficult completeness questions (for which poly degrees is it better etc)
  • Since this is base, more roundoff error consideration may be needed? experimental?

@jarlebring
Copy link
Contributor Author

A closer look at the difference between evalpoly_ps8 and evalpoly under code_native:

  • evalpoly: 8 fma operations (vfmadd213sd)
  • evalpoly_ps8: 6 fma operations and 2 vmulsd

So, the performance difference seems to stem from the difference in computation effort of vfmadd213sd and vmulsd. Therefore, the advantage of PS can be deduced from the fact that PS explicitly builds x^2 and x^3. Can the horner scheme be rephrased to do the same? Not obviously at least. If it would be possible the improvement could go into evalpoly.

@jarlebring
Copy link
Contributor Author

Is there a way to get more information about what went wrong with the tests? A Documenter.jl error suggests it is a doctest?

@jarlebring jarlebring changed the title Performance improvement of Base.Math.exp by using Paterson-Stockmayer Performance improvement of Base.Math.expm1 by using Paterson-Stockmayer May 19, 2021
@JeffreySarnoff
Copy link
Contributor

The error appears to happen elsewhere:
ERROR: LoadError: Failed to precompile Documenter [e30172f5-a6a5-5a46-863b-614d45cd2de4]
@admin ^

@KristofferC
Copy link
Member

The error appears to happen elsewhere:

What do you mean? The stacktrace is:

ERROR: LoadError: BoundsError: attempt to access NTuple{8, Float32} at index [9]
Stacktrace:
  [1] getindex(t::Tuple, i::Int64)
    @ Base ./tuple.jl:29
  [2] evalpoly_ps8
    @ ./special/exp.jl:75 [inlined]
  [3] expb_kernel
    @ ./special/exp.jl:109 [inlined]
  [4] exp_impl
    @ ./special/exp.jl:292 [inlined]
  [5] exp10(x::Float16)
    @ Base.Math ./special/exp.jl:305
  [6] (::Parsers.var"#7#8")(x::Int64)
    @ Parsers ./none:0
  [7] iterate
    @ ./generator.jl:47 [inlined]
  [8] collect(itr::Base.Generator{UnitRange{Int64}, Parsers.var"#7#8"})
    @ Base ./array.jl:710
  [9] top-level scope
    @ /buildworker/worker/package_linux64/build/doc/deps/packages/Parsers/2MBHI/src/floats.jl:385
 [10] include(mod::Module, _path::String)
    @ Base ./Base.jl:389
 [11] include(x::String)
    @ Parsers /buildworker/worker/package_linux64/build/doc/deps/packages/Parsers/2MBHI/src/Parsers.jl:1
 [12] top-level scope
    @ /buildworker/worker/package_linux64/build/doc/deps/packages/Parsers/2MBHI/src/Parsers.jl:737
 [13] include
    @ ./Base.jl:389 [inlined]
 [14] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::String)
    @ Base ./loading.jl:1232
 [15] top-level scope
    @ none:1
 [16] eval
    @ ./boot.jl:369 [inlined]
 [17] eval(x::Expr)
    @ Base.MainInclude ./client.jl:453
 [18] top-level scope
    @ none:1
in expression starting at /buildworker/worker/package_linux64/build/doc/deps/packages/Parsers/2MBHI/src/floats.jl:385
in expression starting at /buildworker/worker/package_linux64/build/doc/deps/packages/Parsers/2MBHI/src/Parsers.jl:1
ERROR: LoadError: Failed to precompile Parsers [69de0a69-1ddd-5017-9359-2bf0b02dc9f0] to /buildworker/worker/package_linux64/build/doc/deps/compiled/v1.7/Parsers/jl_7nEWAM.

Looks pretty relevant to the PR?

return evalpoly(x, (1.0f0, 0.6931472f0, 0.2402265f0,
0.05550411f0, 0.009618025f0,
0.0013333423f0, 0.00015469732f0, 1.5316464f-5))
return evalpoly_ps8(x, (1.0f0, 0.6931472f0, 0.2402265f0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are degree 7, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks. Fixed. I will provide some comparisons for the kernel functions as well to see if it makes sense to add 0 coeffs like I did now.

@jarlebring
Copy link
Contributor Author

jarlebring commented May 19, 2021

I added a zero coefficient when evaluating polynomials of degree 7 rather than treating degree 7 separately. The compiler seems to take care if it since nothing seems to be lost in terms of performance:

@inline function evalpoly_ps8(x,c)
    xx = x * x   # x^2
    xxx = xx * x  # x^3
    X_987 = muladd(c[9], xx, muladd(c[8], x, c[7]))
    X_654 = muladd(xx, c[6], muladd(c[5], x, c[4]))
    Y_1 = muladd(c[3], xx, muladd(c[2], x, c[1]))
    Y = muladd(xxx, muladd(xxx, X_987, X_654), Y_1)
    return Y
end

@inline function evalpoly_ps7(x,c)
    xx = x * x    # x^2
    xxx = xx * x  # x^3
    X_87 = muladd(c[8], x, c[7]) # c[7]+c[8]*x
    X_654 = muladd(xx, c[6], muladd(c[5], x, c[4])) # c[4]+c[5]*x+c[6]*x^2
    Y_1 = muladd(c[3], xx, muladd(c[2], x, c[1]))   # c[1]+c[2]*x+c[3]*x^2
    Y = muladd(xxx, muladd(xxx, X_87, X_654), Y_1)
    return Y
end


@inline function expb_kernel_new_ps7(x::Float32)
    return evalpoly_ps7(x, (1.0f0, 0.6931472f0, 0.2402265f0,
                            0.05550411f0, 0.009618025f0,
                            0.0013333423f0, 0.00015469732f0, 1.5316464f-5))
end

@inline function expb_kernel_new_ps8(x::Float32)
    return evalpoly_ps8(x, (1.0f0, 0.6931472f0, 0.2402265f0,
                            0.05550411f0, 0.009618025f0,
                            0.0013333423f0, 0.00015469732f0, 1.5316464f-5, 0))
end


@inline function expb_kernel_old(x::Float32)
    return evalpoly(x, (1.0f0, 0.6931472f0, 0.2402265f0,
                            0.05550411f0, 0.009618025f0,
                            0.0013333423f0, 0.00015469732f0, 1.5316464f-5))
end

Gives:

julia> x=0.3f0
julia> @btime expb_kernel_old(x)
  10.051 ns (0 allocations: 0 bytes)
7.9894247f0
julia> @btime expb_kernel_new_ps8(x)
  9.517 ns (0 allocations: 0 bytes)
7.9894247f0
julia> @btime expb_kernel_new_ps7(x)
  9.523 ns (0 allocations: 0 bytes)
7.9894247f0

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented May 19, 2021

the old and new functions return different values

julia> x = 0.1f0;

julia> expb_kernel_old(x)
1.0717734f0

julia> expb_kernel_new_ps7(x)
1.0717735f0

julia> expb_kernel_new_ps8(x)
1.0717735f0

@jarlebring
Copy link
Contributor Author

This is due to rounding errors. The result still seems to be within machine precision in a relative sense so I think this should be tolerated:

julia> c=[1.0f0, 0.6931472f0, 0.2402265f0,
                                   0.05550411f0, 0.009618025f0,
                                   0.0013333423f0, 0.00015469732f0, 1.5316464f-5
                                   ];
julia> exct=evalpoly(big(0.1f0),big.(c));
julia> (expb_kernel_new_ps7(0.1f0)-exct)/exct
6.085755593545383320731933637030566010151359188161734553822538617055837342305501e-08
julia> eps(Float32)
1.1920929f-7

@JeffreySarnoff
Copy link
Contributor

from a recent Slack #maths thread:

the number of ulps of precision of each elementary function?
we will not consciously merge a PR that causes a regression in accuracy ...

@oscardssmith
Copy link
Member

If this is a consistent performance improvement, it's an OK change from an accuracy perspective. expm1 currently has .76 ULPs of accuracy for Float64, and this introduces a light regression from .528 to .535 ULPs for expm1_small, but since we are already less accurate over the rest of the domain, that doesn't matter.

@oscardssmith
Copy link
Member

That being said, I would much prefer this PR if it modified evalpoly such that it generated these optimal polynomials automatically rather than it being hard coded for this specific use-case.

@jarlebring
Copy link
Contributor Author

jarlebring commented May 19, 2021

That being said, I would much prefer this PR if it modified evalpoly such that it generated these optimal polynomials automatically rather than it being hard coded for this specific use-case.

Me too.

I think an advantage of this approach could be adapted to a horner scheme if the horner evalutation is only done for the higher coefficients:

xx=x*x
xxx=xx*x
p1=a0+a1*x+a2*xx
p2=p1+xxx*(a3+x*(a4+... x*(a_m-2 + x*(am-1  + am*x))...)

Note: Formula can be evaluated with 2 muls and s fmas, in contrast to current implementation of evalpoly which I think has s+2 fmas.

However, if I would PR evalpoly with such an approach there would be ULPs changes in all functions using evalpoly, and likely one of them will be against the principle of no regression.

@oscardssmith
Copy link
Member

This is fine for a PR to evalpoly. The rule for Floating point functions isn't no loss of accuracy in any case, it's where possible, no reduction in accuracy in general. This method is still stable under the same conditions as evalpoly, and evalpoly isn't a function that we guarantee to be ULP level accurate. (especially if the high order terms are bigger than the low order terms).

@oscardssmith
Copy link
Member

This is really interesting. I've hard-coded the degree 21 version of this, and can confirm it's noticeably faster (it lets me do expm1(::Float16) without a reduction). It would be really nice if we could get this to work automatically for all degrees of evalpoly. This would also be interesting to benchmark for Complex polynomials, where the multiplication is a bit more expensive than addition.

@jarlebring
Copy link
Contributor Author

jarlebring commented May 20, 2021

This is really interesting. I've hard-coded the degree 21 version of this, and can confirm it's noticeably faster (it lets me do expm1(::Float16) without a reduction). It would be really nice if we could get this to work automatically for all degrees of evalpoly. This would also be interesting to benchmark for Complex polynomials, where the multiplication is a bit more expensive than addition.

Okay. Do you mean the PS approach or the approach where we precompute, x^2 and x^3 like I sketched in the comment above?

Here is PS8 in formulas

xx=x*x
xxx=xx*x
p1=c[7]+ c[8]*x+c[9]*xx
p2=c[4]+ c[5]*x+ c[6]*xx
p3=c[1]+c[2]*x+c[3]*xx
p=p3+xxx*(p2+xxx*p1)

It is still not entirely clear to me why this is faster. Earlier I thought the advantage over horner eval was that x*x and xx*x require mul which can be faster than an fma. Note also that the above is much parallelization friendlier. Once x*x is computed xxx,p1,p2,p3 can be computed in parallel. Horner evaluation is completely sequential.

@oscardssmith
Copy link
Member

oscardssmith commented May 20, 2021

yeah. The code is

function _psevalpoly(x, p::Type{NTuple{N,T}}) where {N,T}
        N=1 && return :(p[0])
	s = isqrt(N)
	# generate x^p symbols
	xs = Symbol.("x"*string(i) for i in 1:s)
	exprs = Expr[:(x1=x)]
	for i in 2:s
		push!(exprs, :($(xs[i])=$(xs[i-1])*x))
	end
	ex = :(p[end])
	for i in N-s*s-1:-1:1
		ex = :(muladd(x, $ex, p[$(s*s+i)]))
	end
	for outer in s-1:-1:0
		chain = :(p[$(outer*s+s)])
		for i in s-1:-1:1
			chain = :(muladd($(xs[i]), $chain, p[$(outer*s+i)]))
		end
		ex = :(muladd($(xs[end]), $ex, $chain))
	end
	push!(exprs, ex)
	Expr(:block, exprs...)
end

function myevalpoly(x, p::Tuple)
	if @generated
		return _psevalpoly(x, p)
	else
		evalpoly(x, p)
	end
end

I'm pretty sure the reason this is faster is that it can use more execution ports due to being more paralellizable.

@jarlebring
Copy link
Contributor Author

jarlebring commented May 20, 2021

So what would be needed to get this type of code into evalpoly?

Note that PS is actually a generalization of the horner scheme if we see s as a free parameter. With s=1 the "outer" part reduces to the standard horner evaluation. I am not aware of any works on round-off error analysis of PS eg in comparison to Horner, so switching default behavior of evalpoly to PS with s=isqrt(N) we would not be supported by theory. A not-so-intrusive approach would be to generalize evalpoly to PS with a parameter s which is set to s=1 by default. Then expm1 etc can still use s=isqrt(N) (or something else depending on what we experimentallyl determine is most efficient).

@oscardssmith
Copy link
Member

The error analysis is not too hard here. For polynomials, there are 2 cases for numerical accuracy. If the sequence p[i]*x^i converges to 0, then almost all of the error is in the first few terms, so this change is on average fine (it slightly decreases accuracy, but evalpoly already isn't especially accurate). If p[i]*x^i diverges, then this should be slightly better since it will collect the small terms at the start of the sequence earlier. That said, all of this is basically irreverent since if you actually need ULP level accuracy for polynomial evaluation, you need to use compensated arithmetic (like exthorner). The main thing that makes it hard to put this is evalpoly as is is that I think technically myevalpoly might crash or produce random wrong results since the generated and non-generated paths do not produce equal results (since we only want to use ps for constant polynomial coefficients).

@jarlebring
Copy link
Contributor Author

since we only want to use ps for constant polynomial coefficients

Why? I understand constant coefficients and generated code can lead to further optimization. Why would there be no advantage for arbitrary Float variables?

@oscardssmith
Copy link
Member

Good point.

@generated function evalpoly(x, p::NTuple{N,T}) where {N,T}
	N=1 && return :(p[0])
	s = isqrt(N)
	# generate x^p symbols
	xs = Symbol.("x"*string(i) for i in 1:s)
	exprs = Expr[:(x1=x)]
	for i in 2:s
		push!(exprs, :($(xs[i])=$(xs[i-1])*x))
	end
	ex = :(p[end])
	for i in N-s*s-1:-1:1
		ex = :(muladd(x, $ex, p[$(s*s+i)]))
	end
	for outer in s-1:-1:0
		chain = :(p[$(outer*s+s)])
		for i in s-1:-1:1
			chain = :(muladd($(xs[i]), $chain, p[$(outer*s+i)]))
		end
		ex = :(muladd($(xs[end]), $ex, $chain))
	end
	push!(exprs, ex)
	Expr(:block, exprs...)
end

just works as is.The last thing is to benchmark to see if we want to force s=1 if N<7 or so. I'm guessing that for few coefficients, regular horner might be a better approach.

@jarlebring
Copy link
Contributor Author

jarlebring commented May 20, 2021

The code doesn't run for me. Shouldn't N=1 && return :(p[0]) be N==1 && return :(p[0]) ? After changing it runs okay, but I have some difficulties making benchmark with @generated code...

@oscardssmith
Copy link
Member

Oops. It should actually be N==1 && return :(p[1]) (1 indexing strikes again).

@jarlebring
Copy link
Contributor Author

jarlebring commented May 20, 2021

According to my tests, we should switch to PS at degree 8. I had some difficulties not making the compiler compute the solution directly, so not completely sure it is correct. Messy code here https://pastebin.com/yMep0Ywj

  10.694 ns (0 allocations: 0 bytes) # old  deg 4
  12.279 ns (0 allocations: 0 bytes) # new deg 4
  11.741 ns (0 allocations: 0 bytes) # old  deg 5
  11.187 ns (0 allocations: 0 bytes) # new deg 5
  12.433 ns (0 allocations: 0 bytes) # old deg 6
  13.943 ns (0 allocations: 0 bytes) # new deg 6
  13.016 ns (0 allocations: 0 bytes) # old deg 7
  13.949 ns (0 allocations: 0 bytes) # new deg 7
  14.516 ns (0 allocations: 0 bytes) # old deg 8
  13.646 ns (0 allocations: 0 bytes) # new deg 8

Strange thing: The difference between the computed values for f8 is order of magnitude 1e-8. Is that really consistent with your roundoff error reasoning?

julia> f8_new(x)-f8_old(x)
-7.500022321249578e-8

@oscardssmith
Copy link
Member

Interesting. The accuracy issues here are bigger than I would have though. Part of the problem was that I was generating x^n in a dumb way (repeated multiplication rather than power by squaring). Unfortunately, fixing that has not fixed the problem.

@jarlebring
Copy link
Contributor Author

On second thought, I think there is a bug somewhere in the generated code. With degree 8 we get this generated code:

 :(x1 = x)
 :(x2 = x1 * x)
 :(x3 = x2 * x)
 :(muladd(x3, muladd(x3, muladd(x3, p[end], muladd(x1, muladd(x2, p[9], p[8]), p[7])), muladd(x1, muladd(x2, p[6], p[5]), p[4])), muladd(x1, muladd(x2, p[3], p[2]), p[1])))

It contains both p[end] and p[9] which are the same and it should only contain p[9] once.

@oscardssmith
Copy link
Member

oscardssmith commented May 20, 2021

Good catch. Here's an actually correct version that gives 2e-17 error.

function _evalpoly_new(x, p::Type{NTuple{N,T}}) where {N,T}
	N==1 && return :(p[1])
	s = isqrt(N)
	# generate x^p symbols
	xs = Symbol.("x"*string(i) for i in 1:s)
	exprs = Expr[:(x1=x)]
	for i in 2:s
		halfi = div(i,2)
		if iseven(i)
			push!(exprs, :($(xs[i])=$(xs[halfi])*$(xs[halfi])))
		else
			push!(exprs, :($(xs[i])=$(xs[i-1])*x))
		end
	end
	if s*s<N 
		ex = :(p[end])
		for i in N-s*s-1:-1:1
			ex = :(muladd(x, $ex, p[$(s*s+i)]))
		end
	else
		ex = zero(T)
	end
	for outer in s-1:-1:0
		chain = :(p[$(outer*s+1)])
		for i in 2:s
			chain = :(muladd($(xs[i-1]), p[$(outer*s+i)], $chain))
		end
		ex = :(muladd($(xs[end]), $ex, $chain))
	end
	push!(exprs, ex)
	Expr(:block, exprs...)
end

@generated function evalpoly_new(x, p::NTuple{N,T}) where {N,T}
	return _evalpoly_new(x,p)
end

@oscardssmith
Copy link
Member

oscardssmith commented May 20, 2021

Now the only problem is that it appears this is now slower for degree 9 at least :( I think it's because I'm adding a multiply by zero that I could get rid of.

@chriselrod
Copy link
Contributor

chriselrod commented May 21, 2021

A closer look at the difference between evalpoly_ps8 and evalpoly under code_native:

* `evalpoly`: 8 fma operations (vfmadd213sd)

* `evalpoly_ps8`: 6 fma operations and 2 `vmulsd`

So, the performance difference seems to stem from the difference in computation effort of vfmadd213sd and vmulsd. Therefore, the advantage of PS can be deduced from the fact that PS explicitly builds x^2 and x^3. Can the horner scheme be rephrased to do the same? Not obviously at least. If it would be possible the improvement could go into evalpoly.

On many CPUs, vmulsd and vfmadd213sd will be the same expensive computationally.
The advantage is out of order execution. Horner involves a chain of vfmadd213sd, one after the other.
evalpoly_ps8 splits this chain.
Splitting the chain lets the CPU take advantage of out of order execution.

@jarlebring
Copy link
Contributor Author

It's quite hard for me to read the code. How about separating this into a separate function?

		chain = :(p[$(outer*s+1)])
		for i in 2:s
			chain = :(muladd($(xs[i-1]), p[$(outer*s+i)], $chain))
		end
		ex = :(muladd($(xs[end]), $ex, $chain))

In fact, it is just horner evaluation with appropriate pX as coefficients.

@jarlebring
Copy link
Contributor Author

The advantage is out of order execution. Horner involves a chain of vfmadd213sd, one after the other.
evalpoly_ps8 splits this chain.

Thanks. Yes. We figured this out later in the thread. This thread is a bit long and would be better in a separate evalpoly PR.

@oscardssmith
Copy link
Member

Closing in favor of #40919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants