-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
A faster version of exp, exp2, exp10 for Floating point numbers #36761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Doctests seem to be failing due to a floating point precision issue. Any recommendations? |
|
You can change the docstring. I'm nearly 100% certain that Julia explicitly does not guarantee specific floating point rounding is preserved across minor versions. |
|
Motivation to get something like it, julia> using LoopVectorization
julia> x = rand(256); y = similar(x);
julia> @btime @. $y = exp($x); # base
1.458 μs (0 allocations: 0 bytes)
julia> @btime @avx @. $y = exp($x); # ccall (if you're on Linux and SLEEFPirates could find the appropriate GLIBC)
141.684 ns (0 allocations: 0 bytes)
julia> @btime @avx @. $y = gexp($x); # Julia translation
83.055 ns (0 allocations: 0 bytes)I didn't implement the range check, so an actual implementation would probably be a bit slower. I don't know in how much detail I can describe their algorithm or their implementation of said algorithm to someone, while having that person still license their implementation under the MIT. |
|
BTW, in general, it is not fair to use random numbers in |
|
Do you have a recommendation on how to get a good test set? |
|
I have no particular recommendations, but I think |
julia> @btime @. $y = exp($x);
1.696 μs (0 allocations: 0 bytes)
julia> @btime @avx @. $y = exp($x);
180.336 ns (0 allocations: 0 bytes)
julia> @btime @avx @. $y = gexp($x);
81.509 ns (0 allocations: 0 bytes)
julia> @btime @. $y = myexp($x);
608.153 ns (0 allocations: 0 bytes)
|
|
I'm against table lookups because they are hard to use with SIMD, both Yepp [1] and SLEEF avoid them due to this important reason. [1] https://pdfs.semanticscholar.org/f993/776749e3a3e12836a1802b985cd7b524653d.pdf
Importantly regarding accuracy: GPL source: Can we even use this if you looked at the source (I'm not an expert on this) |
Both The GLIBC version is much faster than that from SLEEF.
I looked at the source, but @oscardssmith did not.
I'll test after I get off work (using your code from SLEEF.jl's test suite), unless someone beats me to it. |
|
Can you post your I'm still against using a table look up for these functions even if it might be able to be vectorized in some cases. A table lookup is probably always going to be faster for a single value. I think the current version will still triumph in overall speed and accuracy considering both single elements and vectorized elements. Note I spent an absurd amount of time exploring various different implementations and versions and the one in base Julia is in my opinion a very very very solid implementation considering many different aspects. |
I'll make a package for these in a couple weeks, but here is a gist for now:
For some reason performance of julia> using LoopVectorization, BenchmarkTools, Test
julia> x = 50 .* randn(256); y = similar(x);
julia> @btime @. $y = exp($x); # current version, scalar
1.468 μs (0 allocations: 0 bytes)
julia> @btime @avx @. $y = exp($x); # GLIBC, SIMD
147.725 ns (0 allocations: 0 bytes)
julia> @btime @avx @. $y = gexp($x); # GLIBC translated to Julia, but leaving off range checks, SIMD
100.337 ns (0 allocations: 0 bytes)
julia> @test y == (@avx exp.(x))
Test Passed
julia> @btime @. $y = myexp($x); # This PR, scalar
608.676 ns (0 allocations: 0 bytes)In your experience, how much performance tends to be lost to gain a few more ULP in accuracy? |
|
From my testing, it doesn't appear that this PR introduces extra ULPs of error. |
julia> @btime exp($(Ref(2.0))[])
9.109 ns (0 allocations: 0 bytes)
julia> @btime gexp($(Ref(2.0))[]) # this PR
311.245 ns (5 allocations: 80 bytes)
julia> x = 10000 .* randn(256); y = similar(x);
julia> @btime @. $y = exp($x);
1.080 μs (0 allocations: 0 bytes)
julia> @btime @. $y = gexp($x);
12.699 μs (63 allocations: 1008 bytes)
What's going on here? |
|
For measuring this PR, you want |
|
https://gist.github.com/musm/9671d346eb1e1fccb1bec9b96d5ed87e |
|
I have absolutely no idea how this would be taking 300ns for scalar or allocating. If you look at the code, there's nothing in there capable of causing an allocation. |
|
Looks like the table look up was missing |
|
@oscardssmith will hopefully soon match the performance of the GPL-ed using Base.Math: significand_bits, exponent_bias
@inline exp_kernel(x::Float64) = evalpoly(x, (1.0, 1.0, .5, .1666666666666666666))
const J_TABLE = Float64[2^((j-1)/1024) for j in 1:1024]
function myexp(x::T) where T<:Float64
# Negation so NaN gets caught
if !(abs(x) < 708.3964185322641)
x <= -708.3964185322641 && return 0.0
x >= 709.782712893384 && return Inf
isnan(x) && return x
end
N = unsafe_trunc(Int, x*1477.3197218702985 + .5) # 1024/ln2
r = x - N*0.0006769015435155716 # ln2/1024
k = N >> 10
two_to_k = reinterpret(T, (k+exponent_bias(T)) << significand_bits(T))
return two_to_k * @inbounds J_TABLE[N&1023 + 1] * exp_kernel(r)
end
using Test, Printf
# the following compares the ulp between x and y.
# First it promotes them to the larger of the two types x,y
const infh(::Type{Float64}) = 1e300
const infh(::Type{Float32}) = 1e37
function countulp(T, x::AbstractFloat, y::AbstractFloat)
X, Y = promote(x, y)
x, y = T(X), T(Y) # Cast to smaller type
(isnan(x) && isnan(y)) && return 0
(isnan(x) || isnan(y)) && return 10000
if isinf(x)
(sign(x) == sign(y) && abs(y) > infh(T)) && return 0 # relaxed infinity handling
return 10001
end
(x == Inf && y == Inf) && return 0
(x == -Inf && y == -Inf) && return 0
if y == 0
x == 0 && return 0
return 10002
end
if isfinite(x) && isfinite(y)
return T(abs(X - Y) / ulp(y))
end
return 10003
end
const DENORMAL_MIN(::Type{Float64}) = 2.0^-1074
const DENORMAL_MIN(::Type{Float32}) = 2f0^-149
function ulp(x::T) where {T<:AbstractFloat}
x = abs(x)
x == T(0.0) && return DENORMAL_MIN(T)
val, e = frexp(x)
return max(ldexp(T(1.0), e - significand_bits(T) - 1), DENORMAL_MIN(T))
end
countulp(x::T, y::T) where {T <: AbstractFloat} = countulp(T, x, y)
strip_module_name(f::Function) = last(split(string(f), '.')) # strip module name from function f
function test_acc(T, fun_table, xx, tol; debug = true, tol_debug = 5)
@testset "accuracy $(strip_module_name(xfun))" for (xfun, fun) in fun_table
rmax = 0.0
rmean = 0.0
xmax = map(zero, first(xx))
tol_debug_failed = 0
for x in xx
q = xfun(x...)
c = fun(map(BigFloat, x)...)
u = countulp(T, q, c)
rmax = max(rmax, u)
xmax = rmax == u ? x : xmax
rmean += u
if debug && u > tol_debug
tol_debug_failed += 1
@printf("%s = %.20g\n%s = %.20g\nx = %.20g\nulp = %g\n", strip_module_name(xfun), q, strip_module_name(fun), T(c), x, u)
end
end
if debug
println("Tol debug failed $(100tol_debug_failed / length(xx))% of the time.")
end
rmean = rmean / length(xx)
fmtxloc = isa(xmax, Tuple) ? string('(', join((@sprintf("%.5f", x) for x in xmax), ", "), ')') : @sprintf("%.5f", xmax)
println(rpad(strip_module_name(xfun), 18, " "), ": max ", @sprintf("%f", rmax),
rpad(" at x = " * fmtxloc, 40, " "),
": mean ", @sprintf("%f", rmean))
t = @test trunc(rmax, digits=1) <= tol
end
end
xx = range(log(floatmin(Float64)), log(floatmax(Float64)), length = 10^6);
fun_table = Dict(myexp => Base.exp)
tol = 10
@time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)I get: Tol debug failed 96.9146% of the time.
myexp : max 4503599627370620.000000 at x = -708.39642 : mean 4503599772.142243
accuracy myexp: Test Failed at REPL[45]:29
Expression: trunc(rmax, digits = 1) <= tol
Evaluated: 4.50359962737062e15 <= 10=/ julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
exp : max 0.874235 at x = 197.89612 : mean 0.263632
Test Summary: | Pass Total
accuracy exp | 1 1
4.071439 seconds (28.83 M allocations: 888.110 MiB, 1.21% gc time)
1-element Vector{Any}:
Test.DefaultTestSet("accuracy exp", Any[], 1, false)If we try a much smaller range (EDIT: the range was just -100,100): Tol debug failed 87.7221% of the time.
myexp : max 482.007891 at x = -96.34780 : mean 56.345570
accuracy myexp: Test Failed at REPL[45]:29
Expression: trunc(rmax, digits = 1) <= tol
Evaluated: 482.0 <= 10Versus julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
gexp : max 2.946842 at x = -0.74290 : mean 0.508303
Test Summary: | Pass Total
accuracy gexp | 1 1
4.218716 seconds (27.41 M allocations: 870.965 MiB, 4.59% gc time)
1-element Vector{Any}:
Test.DefaultTestSet("accuracy gexp", Any[], 1, false)So this PR needs some accuracy work too: max 2.95 instead of 482. Again, julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
exp : max 0.873190 at x = -92.51659 : mean 0.264021
Test Summary: | Pass Total
accuracy exp | 1 1
4.068153 seconds (27.35 M allocations: 866.950 MiB, 1.23% gc time)Meaning if Julia's standard is < 1 ULP, they both need work. What about making this |
|
@chriselrod thanks for looking into this, looks like it's getting close! Yeah the Base.exp is very accurate and for the standard library implementation I do think this is quite important. It looks like the mean error is twice as accurate comparing Base to We are planning on removing My opinion is that it would be best to at least place this in a package alongside other glibc math functions so we can have some cohesion. Right now other than one outlier, all the libm code in Julia is based on a single code base, so we have some uniformity. It's why I decided not to pursue replacing them with, say, SLEEF implementations. The Base code then becomes a bit ad-hoc. I did advocate for making |
|
I have finished the vectorized version. It's a little faster than the glibc version, and about 10x faster than base for vectorized computation |
|
Also note that if desired, a very similar exp10 implementation could be made that would be of similar performance. It would use 1 extra term in the taylor series, and an array of 256 values. |
base/special/exp.jl
Outdated
| MAX_EXP(::Type{Float64}) = 709.782712893383996732 | ||
| MAX_EXP(::Type{Float32}) = 88.72283905206835f0 # log 2^127 *(2-2^-23) | ||
|
|
||
| # one less than the min exponent since we can sqeeze a bit more from the exp function | ||
| MIN_EXP(::Type{Float64}) = -7.451332191019412076235e2 # log 2^-1075 | ||
| MIN_EXP(::Type{Float64}) = -745.1332191019412076235 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removed the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly because this PR is in progress. I had taken out the entire lines and when I added them back, it was missing the comment.
|
I actually think this commit has a bug. It makes it about twice as wrong on big and small inputs (< -270 or >270). |
|
Now for the test result, I get So the new commit improves a lot, but it still gets up to 6.6ulps for high inputs, and is above 3ulps 5% of the time. |
|
Actually testing julia> xx = range(log(floatmin(Float64)), log(floatmax(Float64)), length = 10^6);
julia> fun_table = Dict(gexp => Base.exp)
Dict{typeof(gexp),typeof(exp)} with 1 entry:
gexp => exp
julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
gexp : max 2.990720 at x = 652.23153 : mean 0.507622
Test Summary: | Pass Total
accuracy gexp | 1 1
3.860116 seconds (28.50 M allocations: 869.808 MiB, 1.31% gc time)So there's still some accuracy left on the table. |
It was over 14 times as fast on my computer! But I agree that accuracy-by-default makes sense for a standard library. I'm fine with the separate library approach. By overloading on custom number types (in particular, struct-wrapped I'll probably make one with the LGPL2 license and just put direct translations of the GLIBC vector math functions in there. I don't think the license would be a problem for other Julia packages wishing to depend on it, and it's a lot easier to get good accuracy/performance if you can just read their source code. But while I can get along perfectly fine without it, other people may want an easier route than taking on fairly heavy dependencies (i.e., LoopVectorization to make switching Perhaps @oscardssmith, you could try export JULIA_LLVM_ARGS="--pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize"or fish: set -x JULIA_LLVM_ARGS --pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorizeand see if you can find out why |
|
In my newest commit, I get Turns out taylor series, while great aren't the greatest. Thanks to Remez.jl for the easy way to find the coefs! |
|
If you remove the branch and write an remark: simdloop.jl:75:0: loop not vectorized: cannot identify array bounds
remark: simdloop.jl:75:0: loop not vectorizedGiven that we don't get this without |
|
Note that when vectorized, this implementation is about 10x faster than the current implementation and in fact is as fast as the LoopVectorization version (Glibc). Should we get Triage to make a decision on whether this slight accuracy loss is acceptable for a 4x faster (scalar) or 10x faster (vectorized). Given how widely used |
|
Can someone explain what the missing GC roots failure is in the analyzegc test? @yuyichao maybe? Thanks! |
|
Probably unrelated? Try pushing a squashed+rebased copy, as the commit list looks a little mixed up too. |
|
I apparently don't know how to squash this properly. What do I need to do to make that work? I think I have the rebase done correctly though. |
I'm seeing 191 commits now, so I think something still went wrong with the rebase. For something "complicated" like this has gotten, I usually go incrementally toward a fixed-up state. Make sure your After that first iteration is done, you can then do the squash. Again, start with Once that is done, you can check that there's only the number of squashed commits you want with |
This is based on the Glibc algorithm which @chriselrod (Elrond on discourse) described the algorithm of for me. It appears to be about 2x faster than the current algorithm, and equally accurate over the range for which I have tried it. It also theoretically should be easier to vectorize as branches are only used for checking for over/underflow. Update base/special/exp.jl Co-authored-by: Jeff Bezanson <jeff.bezanson@gmail.com> Better subnormal numbers, constant usage, and more accurate Break r into a hi and lo part to get extra accuracy. Fix previous comit. Error matches gexp Switch to minimax polynomial from taylor polynomial. equally fast version with a smaller table. Uses a quartic which allows a smaller table. Performance is equal to slightly better, and accuracy is similar. fully working? fix expm1 for Float32 (calling wrong libm function) actually fix expm1 re-add exp10 docs slightly extend upper range for Float32 arguments re-add exp doctest, remove redundant exp10 doctest maybe now placing it after use round instead of magic re-add magic with better explanation remove the typo Fix all numbers taking the slow path with range checking. Oops Fix 32 bit build. Use `:ℯ` instead of `float64 ℯ` no functional change but less hacky Update base/special/exp.jl Co-authored-by: jmert <2965436+jmert@users.noreply.github.com> Update base/special/exp.jl Co-authored-by: jmert <2965436+jmert@users.noreply.github.com> Update base/special/exp.jl Co-authored-by: jmert <2965436+jmert@users.noreply.github.com> Line wrap table
30d43d9 to
bc96a40
Compare
|
Merge time? |
base/special/exp.jl
Outdated
| for (func, base) in (:exp2=>Val(2), :exp=>Val(:ℯ), :exp10=>Val(10)) | ||
| @eval begin | ||
| ($func)(x::Real) = ($func)(float(x)) | ||
| @inline function ($func)(x::T) where T<:Float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we now adding @inline to each of the exp functions? I'm not sure if this is appropriate, as before we did not enforce inlining of these elementary functions. Is there a good reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inlining allows vectorization when combined with @simd ivdep. It often can yield 2x performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inline before would have been useless as the old versions had way too many branches to benefit from vectorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still this is something to be decided upon. When I inlined sin and cos it caused some compiled code to grow a lot, so I had to undo it #24117 (unfortunately that PR doesn't link the original issue which I believe was by @JeffBezanson ). I don't know if this applies here as well (I'm not smart enough to know if the table will also be inlined into each function that then calls exp).
Quick question: when we talk about performance improvements here, are they then your new inlined versions versus the old un-inlined (?) versions? Or did you then also apply inline to the existing versions (which would only seem fair as we could also just inline the current versions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried it on this branch, but you can use the new inline-cost printer to see what might be holding this back from being automatically inlined. Demo on the current exp:
julia> Base.print_statement_costs(exp, (Float64,))
exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:74
1 1 ── %1 = Base.bitcast(UInt64, _2)::UInt64
1 │ %2 = Base.and_int(%1, 0x7fffffffffffffff)::UInt64
1 │ %3 = Base.bitcast(Base.Int64, _2)::Int64
1 │ %4 = Base.slt_int(%3, 0)::Bool
1 │ %5 = Base.ult_int(0x40862e42fefa39ef, %2)::Bool
0 └─── goto #12 if not %5
1 2 ── %7 = Base.ule_int(0x7ff0000000000000, %2)::Bool
0 └─── goto #8 if not %7
1 3 ── %9 = Base.and_int(%2, 0x000fffffffffffff)::UInt64
1 │ %10 = Base.sle_int(0, 0)::Bool
1 │ %11 = Base.bitcast(UInt64, 0)::UInt64
1 │ %12 = (%9 === %11)::Bool
1 │ %13 = Base.and_int(%10, %12)::Bool
0 │ %14 = Base.not_int(%13)::Bool
0 └─── goto #5 if not %14
0 4 ── return NaN
0 5 ── goto #7 if not %4
0 6 ── return 0.0
0 7 ── return Inf
2 8 ── %20 = Base.lt_float(709.782712893384, _2)::Bool
0 └─── goto #10 if not %20
0 9 ── return Inf
2 10 ─ %23 = Base.lt_float(_2, -745.1332191019412)::Bool
0 └─── goto #12 if not %23
0 11 ─ return 0.0
2 12 ┄ %26 = Base.eq_float(_2, 1.0)::Bool
0 └─── goto #15 if not %26
0 13 ─ goto #15 if not true
0 14 ─ return 2.718281828459045
1 15 ┄ %30 = Base.ult_int(0x3fd62e42fefa39ef, %2)::Bool
0 └─── goto #27 if not %30
1 16 ─ %32 = Base.ult_int(%2, 0x3ff0a2b23f3bab73)::Bool
0 └─── goto #21 if not %32
0 17 ─ goto #19 if not %4
1 18 ─ %35 = Base.add_float(_2, 0.6931471803691238)::Float64
0 └─── goto #20
1 19 ─ %37 = Base.sub_float(_2, 0.6931471803691238)::Float64
0 20 ┄ %38 = φ (#18 => -1.9082149292705877e-10, #19 => 1.9082149292705877e-10)::Float64
0 │ %39 = φ (#18 => %35, #19 => %37)::Float64
0 │ %40 = φ (#18 => -1, #19 => 1)::Int64
0 └─── goto #22
4 21 ─ %42 = Base.mul_float(1.4426950408889634, _2)::Float64
10 │ %43 = Base.rint_llvm(%42)::Float64
1 │ %44 = Base.fptosi(Int64, %43)::Int64
5 │ %45 = Base.muladd_float(%43, -0.6931471803691238, _2)::Float64
4 └─── %46 = Base.mul_float(%43, 1.9082149292705877e-10)::Float64
0 22 ┄ %47 = φ (#20 => %38, #21 => %46)::Float64
0 │ %48 = φ (#20 => %39, #21 => %45)::Float64
0 │ %49 = φ (#20 => %40, #21 => %44)::Int64
1 │ %50 = Base.sub_float(%48, %47)::Float64
4 │ %51 = Base.mul_float(%50, %50)::Float64
5 │ %52 = Base.muladd_float(%51, 4.1381367970572385e-8, -1.6533902205465252e-6)::Float64
5 │ %53 = Base.muladd_float(%51, %52, 6.613756321437934e-5)::Float64
5 │ %54 = Base.muladd_float(%51, %53, -0.0027777777777015593)::Float64
5 │ %55 = Base.muladd_float(%51, %54, 0.16666666666666602)::Float64
4 │ %56 = Base.mul_float(%51, %55)::Float64
1 │ %57 = Base.sub_float(%50, %56)::Float64
4 │ %58 = Base.mul_float(%50, %57)::Float64
1 │ %59 = Base.sub_float(2.0, %57)::Float64
20 │ %60 = Base.div_float(%58, %59)::Float64
1 │ %61 = Base.sub_float(%47, %60)::Float64
1 │ %62 = Base.sub_float(%61, %48)::Float64
1 │ %63 = Base.sub_float(1.0, %62)::Float64
1 │ %64 = Base.slt_int(-52, %49)::Bool
0 └─── goto #26 if not %64
1 23 ─ %66 = (%49 === 1024)::Bool
0 └─── goto #25 if not %66
20 24 ─ %68 = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 2.0, 1023.0, 1023.0, 2.0))::Float64
4 │ %69 = Base.mul_float(%63, 2.0)::Float64
4 │ %70 = Base.mul_float(%69, %68)::Float64
0 └─── return %70
1 25 ─ %72 = Base.add_int(1023, %49)::Int64
1 │ %73 = Base.bitcast(UInt64, %72)::UInt64
1 │ %74 = Base.sle_int(0, 52)::Bool
1 │ %75 = Base.bitcast(UInt64, 52)::UInt64
1 │ %76 = Base.shl_int(%73, %75)::UInt64
1 │ %77 = Base.neg_int(52)::Int64
1 │ %78 = Base.bitcast(UInt64, %77)::UInt64
1 │ %79 = Base.lshr_int(%73, %78)::UInt64
1 │ %80 = Base.ifelse(%74, %76, %79)::UInt64
1 │ %81 = Base.bitcast(Float64, %80)::Float64
4 │ %82 = Base.mul_float(%63, %81)::Float64
0 └─── return %82
1 26 ─ %84 = Base.add_int(1023, 52)::Int64
1 │ %85 = Base.add_int(%84, 1)::Int64
1 │ %86 = Base.add_int(%85, %49)::Int64
1 │ %87 = Base.bitcast(UInt64, %86)::UInt64
1 │ %88 = Base.sle_int(0, 52)::Bool
1 │ %89 = Base.bitcast(UInt64, 52)::UInt64
1 │ %90 = Base.shl_int(%87, %89)::UInt64
1 │ %91 = Base.neg_int(52)::Int64
1 │ %92 = Base.bitcast(UInt64, %91)::UInt64
1 │ %93 = Base.lshr_int(%87, %92)::UInt64
1 │ %94 = Base.ifelse(%88, %90, %93)::UInt64
1 │ %95 = Base.bitcast(Float64, %94)::Float64
20 │ %96 = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 2.0, -53.0, -53.0, 2.0))::Float64
4 │ %97 = Base.mul_float(%63, %95)::Float64
4 │ %98 = Base.mul_float(%97, %96)::Float64
0 └─── return %98
1 27 ─ %100 = Base.ult_int(%2, 0x3e30000000000000)::Bool
0 └─── goto #29 if not %100
1 28 ─ %102 = Base.add_float(1.0, _2)::Float64
0 └─── return %102
4 29 ─ %104 = Base.mul_float(_2, _2)::Float64
5 │ %105 = Base.muladd_float(%104, 4.1381367970572385e-8, -1.6533902205465252e-6)::Float64
5 │ %106 = Base.muladd_float(%104, %105, 6.613756321437934e-5)::Float64
5 │ %107 = Base.muladd_float(%104, %106, -0.0027777777777015593)::Float64
5 │ %108 = Base.muladd_float(%104, %107, 0.16666666666666602)::Float64
4 │ %109 = Base.mul_float(%104, %108)::Float64
1 │ %110 = Base.sub_float(_2, %109)::Float64
4 │ %111 = Base.mul_float(_2, %110)::Float64
1 │ %112 = Base.sub_float(%110, 2.0)::Float64
20 │ %113 = Base.div_float(%111, %112)::Float64
1 │ %114 = Base.sub_float(%113, _2)::Float64
1 │ %115 = Base.sub_float(1.0, %114)::Float64
0 └─── return %115If any of your costs don't seem appropriate then we can make adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really neat feature.
But I think the issue here is that we don't necessarily want it to inline normally, but we do if it's in an @simd ivdep loop, because
(a) In a loop, there are many calls, so the code size cost is amortized by a greater number of calls.
(b) The big one here: inlining lets it SIMD, so the performance difference between inlined vs not-inlined is much higher (> 5x on my computer).
Therefore, the ideal behavior would probably be something along the lines of "inline if it's inside an @simd ivdep loop, don't inline otherwise".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle it shouldn't be too hard to adjust the threshold, because of the annotation of the loop with Expr(:loopinfo, Symbol("julia.simdloop"), nothing). It might be a little sensitive to when in the optimizer pipeline that runs, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's pretty or preferable but you could define _exp or inline_exp instead of the current exp and then define an exp (that doesn't inline) so normal users would not have this happen but advanced users could specifically ask for it. I'm above my paygrade here though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry, so am I :)
|
Bumping this. The remaining test failure is unrelated, and while we could bike-shed whether this should be inlined and similar things for ever, I think the benefits over the current implementation are such that we should prioritize getting this in in some form for 1.6. |
|
@StefanKarpinski why is this labeled triage? Is there anything that requires decision here? |
|
I think @simonbyrne mentioned he has time next week to take a detailed look. That's probably the last step. Also the inlining issue, IMO I don't think we should inline them by default |
|
I'll push a non-inlined version tonight. Anything the pr needs? |
|
@simonbyrne any chance you have some free time to review this? If it will would help, I'd be happy to set up a call so I could answer questions synchronously. |
|
@Keno you said that this is ready to merge in your opinion once it gets a CI re-run. Right? |
|
Oops.. I thought CI was already rerun here :/ My bad. Let's see how it goes. |
|
I'm getting a loss of accuracy in a SpecialFunctions test due to this change; currently trying to track it down further. |
|
Hm, that's weird. this should be strictly more accurate. Are you sure it's not the sinh/cosh reltated? That PR changed the accuracy profile of the functions a bit (mostly for the better, but I'm not sure that all inputs got better). |
|
I tracked down the problem. The new False alarm, sorry! |
This is based on the Glibc algorithm which @chriselrod (Elrond on discourse) described the algorithm of for me. It appears to be about 2x faster than the current algorithm, and equally accurate over the range for which I have tried it. It also theoretically should be easier to vectorize as branches are only used for checking for over/underflow. The biggest downside is that it uses a table of 1024 values to speed up the calculation.