Skip to content

Conversation

@oscardssmith
Copy link
Member

This is based on the Glibc algorithm which @chriselrod (Elrond on discourse) described the algorithm of for me. It appears to be about 2x faster than the current algorithm, and equally accurate over the range for which I have tried it. It also theoretically should be easier to vectorize as branches are only used for checking for over/underflow. The biggest downside is that it uses a table of 1024 values to speed up the calculation.

@oscardssmith
Copy link
Member Author

Doctests seem to be failing due to a floating point precision issue. Any recommendations?

@jakobnissen
Copy link
Member

You can change the docstring. I'm nearly 100% certain that Julia explicitly does not guarantee specific floating point rounding is preserved across minor versions.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 22, 2020

Motivation to get something like it, gexp is my translation of the GLIBC implementation (that I wrote by looking at the GPL source, meaning its definition must be GPL-ed):

julia> using LoopVectorization

julia> x = rand(256); y = similar(x);

julia> @btime @. $y = exp($x); # base
  1.458 μs (0 allocations: 0 bytes)

julia> @btime @avx @. $y = exp($x);  # ccall (if you're on Linux and SLEEFPirates could find the appropriate GLIBC)
  141.684 ns (0 allocations: 0 bytes)

julia> @btime @avx @. $y = gexp($x); # Julia translation
  83.055 ns (0 allocations: 0 bytes)

I didn't implement the range check, so an actual implementation would probably be a bit slower.

I don't know in how much detail I can describe their algorithm or their implementation of said algorithm to someone, while having that person still license their implementation under the MIT.

@kimikage
Copy link
Contributor

BTW, in general, it is not fair to use random numbers in [0, 1) for benchmarking exp. Some algorithms rely heavily on the branch prediction.

@oscardssmith
Copy link
Member Author

Do you have a recommendation on how to get a good test set?

@kimikage
Copy link
Contributor

I have no particular recommendations, but I think randn is simple and slightly fairer than rand.

@chriselrod
Copy link
Contributor

julia> @btime @. $y = exp($x);
  1.696 μs (0 allocations: 0 bytes)

julia> @btime @avx @. $y = exp($x);
  180.336 ns (0 allocations: 0 bytes)

julia> @btime @avx @. $y = gexp($x);
  81.509 ns (0 allocations: 0 bytes)

julia> @btime @. $y = myexp($x);
  608.153 ns (0 allocations: 0 bytes)

exp is the current version, and myexp is from this PR. It is failing to SIMD, but the scalar performance is impressive.
I (or someone else) can do an accuracy comparison later.

@musm
Copy link
Contributor

musm commented Jul 22, 2020

I'm against table lookups because they are hard to use with SIMD, both Yepp [1] and SLEEF avoid them due to this important reason.

[1] https://pdfs.semanticscholar.org/f993/776749e3a3e12836a1802b985cd7b524653d.pdf

Four design principles for portable high-throughput vector elementary
functions motivated by processor microarchitecture.
I No table lookups
I No division or square root operations
I No branches except for special cases
I Use Horner Scheme and Newton-Raphson with Software Pipelining

Importantly regarding accuracy:
Have you tested this is <= 1 ULP over the FULL domain.

GPL source: Can we even use this if you looked at the source (I'm not an expert on this)

@chriselrod
Copy link
Contributor

chriselrod commented Jul 22, 2020

I'm against table lookups because they are hard to use with SIMD, both Yepp [1] and SLEEF avoid them due to this important reason.

Both @avx versions I benchmarked above are SIMD and use table lookups. They're what this version is based on, but it may take more work to make them SIMD. The autovectorizer does sometimes emit gather instructions, but it is reluctant to do so.
The @avx exp is using ccall to call the actual GLIBC version, which are written directly in x86 assembly.
The @avx gexp was my translation to Julia, using SIMDPirates (wrapping LLVM intrinsics). I'll probably put it in a GPLed Julia library in a couple weeks.
These don't have to rely on the autovectorizer to SIMD the table lookup.

The GLIBC version is much faster than that from SLEEF.

GPL source: Can we even use this if you looked at the source (I'm not an expert on this)

I looked at the source, but @oscardssmith did not.
I'm not an expert on this either. I'd like to know how much I can be involved in this / communicate about the GPL-ed implementation without causing problems.

Importantly regarding accuracy:
Have you tested this is <= 1 ULP over the FULL domain.

I'll test after I get off work (using your code from SLEEF.jl's test suite), unless someone beats me to it.
I'm expecting slightly more ULP than that.

@musm
Copy link
Contributor

musm commented Jul 22, 2020

Can you post your gexp (or put it in a package or gist...probably the best idea)?

I'm still against using a table look up for these functions even if it might be able to be vectorized in some cases.

A table lookup is probably always going to be faster for a single value. I think the current version will still triumph in overall speed and accuracy considering both single elements and vectorized elements. Note I spent an absurd amount of time exploring various different implementations and versions and the one in base Julia is in my opinion a very very very solid implementation considering many different aspects.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 22, 2020

Can you post your gexp (or put it in a package or gist...probably the best idea)?

I'll make a package for these in a couple weeks, but here is a gist for now:
https://gist.github.com/chriselrod/d5cbc1dc7f2427cea86080d33a0cf5dc
I placed the GPL license at the top of the file. Beware of clicking the link...

I think the current version will still triumph in overall speed and accuracy considering both single elements and vectorized elements.

For some reason performance of gexp seems to have regress from 80->100 ns when switching from LLVM10 to LLVM9 for 256 doubles on my computer, but it is far ahead in terms of speed when vectorized:

julia> using LoopVectorization, BenchmarkTools, Test

julia> x = 50 .* randn(256); y = similar(x);

julia> @btime @. $y = exp($x); # current version, scalar
  1.468 μs (0 allocations: 0 bytes)

julia> @btime @avx @. $y = exp($x); # GLIBC, SIMD
  147.725 ns (0 allocations: 0 bytes)

julia> @btime @avx @. $y = gexp($x); # GLIBC translated to Julia, but leaving off range checks, SIMD
  100.337 ns (0 allocations: 0 bytes)

julia> @test y == (@avx exp.(x))
Test Passed

julia> @btime @. $y = myexp($x); # This PR, scalar
  608.676 ns (0 allocations: 0 bytes)

In your experience, how much performance tends to be lost to gain a few more ULP in accuracy?

@oscardssmith
Copy link
Member Author

From my testing, it doesn't appear that this PR introduces extra ULPs of error.

@musm
Copy link
Contributor

musm commented Jul 22, 2020

julia> @btime exp($(Ref(2.0))[])
  9.109 ns (0 allocations: 0 bytes)


julia> @btime gexp($(Ref(2.0))[])   # this PR
  311.245 ns (5 allocations: 80 bytes)

julia>  x = 10000 .* randn(256); y = similar(x);


julia>  @btime @. $y = exp($x);
  1.080 μs (0 allocations: 0 bytes)

julia>  @btime @. $y = gexp($x);
  12.699 μs (63 allocations: 1008 bytes)

What's going on here?

@oscardssmith
Copy link
Member Author

For measuring this PR, you want exp not gexp (that is the vectorized version that I haven't yet finished). I get

julia> @btime myexp($(Ref(2.0))[]);
  2.234 ns (0 allocations: 0 bytes)

julia> @btime exp($(Ref(2.0))[]);
  6.141 ns (0 allocations: 0 bytes)

@musm
Copy link
Contributor

musm commented Jul 22, 2020

gexp is what's posted in this PR (I just copy pasted this PR and renamed it to gexp (bad name I guess))

https://gist.github.com/musm/9671d346eb1e1fccb1bec9b96d5ed87e

@oscardssmith
Copy link
Member Author

I have absolutely no idea how this would be taking 300ns for scalar or allocating. If you look at the code, there's nothing in there capable of causing an allocation.

@musm
Copy link
Contributor

musm commented Jul 22, 2020

Looks like the table look up was missing Float64 that was corrected by a comment that I missed.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 23, 2020

@oscardssmith will hopefully soon match the performance of the GPL-ed gexp implementation.
But in the meantime, testing the accuracy of the version currently in this PR (source for the accuracy code):

using Base.Math: significand_bits, exponent_bias

@inline exp_kernel(x::Float64) = evalpoly(x, (1.0, 1.0, .5, .1666666666666666666))

const J_TABLE = Float64[2^((j-1)/1024) for j in 1:1024]

function myexp(x::T) where T<:Float64
	# Negation so NaN gets caught
    if !(abs(x) < 708.3964185322641)
        x <= -708.3964185322641 && return 0.0
        x >=  709.782712893384 && return Inf
        isnan(x) && return x
    end
    N = unsafe_trunc(Int, x*1477.3197218702985 + .5)   # 1024/ln2
    r = x - N*0.0006769015435155716                    # ln2/1024
    k = N >> 10

    two_to_k = reinterpret(T, (k+exponent_bias(T)) << significand_bits(T))
    return two_to_k * @inbounds J_TABLE[N&1023 + 1] * exp_kernel(r)
end

using Test, Printf

# the following compares the ulp between x and y.
# First it promotes them to the larger of the two types x,y
const infh(::Type{Float64}) = 1e300
const infh(::Type{Float32}) = 1e37
function countulp(T, x::AbstractFloat, y::AbstractFloat)
    X, Y = promote(x, y)
    x, y = T(X), T(Y) # Cast to smaller type
    (isnan(x) && isnan(y)) && return 0
    (isnan(x) || isnan(y)) && return 10000
    if isinf(x)
        (sign(x) == sign(y) && abs(y) > infh(T)) && return 0 # relaxed infinity handling
        return 10001
    end
    (x ==  Inf && y ==  Inf) && return 0
    (x == -Inf && y == -Inf) && return 0
    if y == 0
        x == 0 && return 0
        return 10002
    end
    if isfinite(x) && isfinite(y)
        return T(abs(X - Y) / ulp(y))
    end
    return 10003
end

const DENORMAL_MIN(::Type{Float64}) = 2.0^-1074 
const DENORMAL_MIN(::Type{Float32}) = 2f0^-149

function ulp(x::T) where {T<:AbstractFloat}
    x = abs(x)
    x == T(0.0) && return DENORMAL_MIN(T)
    val, e = frexp(x)
    return max(ldexp(T(1.0), e - significand_bits(T) - 1), DENORMAL_MIN(T))
end

countulp(x::T, y::T) where {T <: AbstractFloat} = countulp(T, x, y)
strip_module_name(f::Function) = last(split(string(f), '.')) # strip module name from function f

function test_acc(T, fun_table, xx, tol; debug = true, tol_debug = 5)
    @testset "accuracy $(strip_module_name(xfun))" for (xfun, fun) in fun_table
        rmax = 0.0
        rmean = 0.0
        xmax = map(zero, first(xx))
        tol_debug_failed = 0
        for x in xx
            q = xfun(x...)
            c = fun(map(BigFloat, x)...)
            u = countulp(T, q, c)
            rmax = max(rmax, u)
            xmax = rmax == u ? x : xmax
            rmean += u
            if debug && u > tol_debug
                tol_debug_failed += 1
                @printf("%s = %.20g\n%s  = %.20g\nx = %.20g\nulp = %g\n", strip_module_name(xfun), q, strip_module_name(fun), T(c), x, u)
            end
        end
        if debug
            println("Tol debug failed $(100tol_debug_failed / length(xx))% of the time.")
        end
        rmean = rmean / length(xx)

        fmtxloc = isa(xmax, Tuple) ? string('(', join((@sprintf("%.5f", x) for x in xmax), ", "), ')') : @sprintf("%.5f", xmax)
        println(rpad(strip_module_name(xfun), 18, " "), ": max ", @sprintf("%f", rmax),
            rpad(" at x = " * fmtxloc, 40, " "),
            ": mean ", @sprintf("%f", rmean))
       
        t = @test trunc(rmax, digits=1) <= tol

    end
end

xx = range(log(floatmin(Float64)), log(floatmax(Float64)), length = 10^6);
fun_table = Dict(myexp => Base.exp)
tol = 10
@time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)

I get:

Tol debug failed 96.9146% of the time.
myexp             : max 4503599627370620.000000 at x = -708.39642                      : mean 4503599772.142243
accuracy myexp: Test Failed at REPL[45]:29
  Expression: trunc(rmax, digits = 1) <= tol
   Evaluated: 4.50359962737062e15 <= 10

=/
Versus fun_table = Dict(Base.exp => Base.exp):

julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
exp               : max 0.874235 at x = 197.89612                       : mean 0.263632
Test Summary: | Pass  Total
accuracy exp  |    1      1
  4.071439 seconds (28.83 M allocations: 888.110 MiB, 1.21% gc time)
1-element Vector{Any}:
 Test.DefaultTestSet("accuracy exp", Any[], 1, false)

If we try a much smaller range (EDIT: the range was just -100,100):

Tol debug failed 87.7221% of the time.
myexp             : max 482.007891 at x = -96.34780                       : mean 56.345570
accuracy myexp: Test Failed at REPL[45]:29
  Expression: trunc(rmax, digits = 1) <= tol
   Evaluated: 482.0 <= 10

Versus gexp as I defined/benchmarked in earlier comments:

julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
gexp              : max 2.946842 at x = -0.74290                        : mean 0.508303
Test Summary: | Pass  Total
accuracy gexp |    1      1
  4.218716 seconds (27.41 M allocations: 870.965 MiB, 4.59% gc time)
1-element Vector{Any}:
 Test.DefaultTestSet("accuracy gexp", Any[], 1, false)

So this PR needs some accuracy work too: max 2.95 instead of 482. Again, gexp follows the same algorithm and is equivalent in the major details, so addressing this accuracy deficit should just require a few tweaks...
Versus the current Base.exp which had a max error of about 0.875 over the larger and the smaller ranges:

julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
exp               : max 0.873190 at x = -92.51659                       : mean 0.264021
Test Summary: | Pass  Total
accuracy exp  |    1      1
  4.068153 seconds (27.35 M allocations: 866.950 MiB, 1.23% gc time)

Meaning if Julia's standard is < 1 ULP, they both need work.

What about making this Base.FastMath.exp_fast?

@musm
Copy link
Contributor

musm commented Jul 23, 2020

@chriselrod thanks for looking into this, looks like it's getting close!

Yeah the Base.exp is very accurate and for the standard library implementation I do think this is quite important. It looks like the mean error is twice as accurate comparing Base to gexp. So we see the trade-off that your code is at least twice as fast.

We are planning on removing @fastmath from base, since it's so problematic and really doesn't need to be a macro.

My opinion is that it would be best to at least place this in a package alongside other glibc math functions so we can have some cohesion. Right now other than one outlier, all the libm code in Julia is based on a single code base, so we have some uniformity. It's why I decided not to pursue replacing them with, say, SLEEF implementations. The Base code then becomes a bit ad-hoc.

I did advocate for making Math a standard library so these could perhaps be more easily swapped out in user code in #33288 but that did not receive good traction. If you feel like this would also be a good idea, please do comment.

@oscardssmith
Copy link
Member Author

I have finished the vectorized version. It's a little faster than the glibc version, and about 10x faster than base for vectorized computation

using LoopVectorization
using VectorizationBase
using SIMDPirates

@noinline function myexp_fallback(x::SVec{W,Float64}) where {W}
	return SVec(ntuple(Val(W)) do i Core.VecElement(myexp(x[i])) end)
end

@inline function myexp(x::SVec{W,Float64}) where {W}
    # The first magic constant is 1024/ln2. The second is 1.5*2^52.
    # This line uses bit-magic to round and convert.
    # It needs to be inlined to use the muladd
    any(!(abs(x) < 708.3964185322641)) && return myexp_fallback(x)
    N_float = vmuladd(x, 1477.3197218702985, 6755399441055744.0)
    N = reinterpret(SVec{W,Int64}, N_float) & Int64(0xFFFFFFFF)
    N_float = vsub(N_float, 6755399441055744.0)
    
    r = vmuladd(N_float, -0.0006769015435155716 , x)       # -ln2/1024
    js = vload(stridedpointer(J_TABLE), (N&1023,))
    k = N >> 10
    
    small_part = reinterpret(SVec{W, Int64}, vmul(js, exp_kernel(r)))
    return reinterpret(SVec{W, Float64}, vadd(k<<52, small_part))
end

@oscardssmith
Copy link
Member Author

Also note that if desired, a very similar exp10 implementation could be made that would be of similar performance. It would use 1 extra term in the taylor series, and an array of 256 values.

Comment on lines 53 to 57
MAX_EXP(::Type{Float64}) = 709.782712893383996732
MAX_EXP(::Type{Float32}) = 88.72283905206835f0 # log 2^127 *(2-2^-23)

# one less than the min exponent since we can sqeeze a bit more from the exp function
MIN_EXP(::Type{Float64}) = -7.451332191019412076235e2 # log 2^-1075
MIN_EXP(::Type{Float64}) = -745.1332191019412076235
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removed the comments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly because this PR is in progress. I had taken out the entire lines and when I added them back, it was missing the comment.

@oscardssmith
Copy link
Member Author

I actually think this commit has a bug. It makes it about twice as wrong on big and small inputs (< -270 or >270).

@oscardssmith
Copy link
Member Author

Now for the test result, I get

julia> xx = range(log(floatmin(Float64)), log(floatmax(Float64))-.01, length = 10^6);
julia> test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 4.9347% of the time.
myexp             : max 6.619771 at x = 476.16741                       : mean 0.928930
Test Summary:  | Pass  Total
accuracy myexp |    1      1
1-element Array{Any,1}:
 Test.DefaultTestSet("accuracy myexp", Any[], 1, false)

So the new commit improves a lot, but it still gets up to 6.6ulps for high inputs, and is above 3ulps 5% of the time.

@chriselrod
Copy link
Contributor

Actually testing gexp over the full range:

julia> xx = range(log(floatmin(Float64)), log(floatmax(Float64)), length = 10^6);

julia> fun_table = Dict(gexp => Base.exp)
Dict{typeof(gexp),typeof(exp)} with 1 entry:
  gexp => exp

julia> @time test_acc(Float64, fun_table, xx, tol, debug = true, tol_debug = 3)
Tol debug failed 0.0% of the time.
gexp              : max 2.990720 at x = 652.23153                       : mean 0.507622
Test Summary: | Pass  Total
accuracy gexp |    1      1
  3.860116 seconds (28.50 M allocations: 869.808 MiB, 1.31% gc time)

So there's still some accuracy left on the table.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 24, 2020

Yeah the Base.exp is very accurate and for the standard library implementation I do think this is quite important. It looks like the mean error is twice as accurate comparing Base to gexp. So we see the trade-off that your code is at least twice as fast.

It was over 14 times as fast on my computer! But I agree that accuracy-by-default makes sense for a standard library.

I'm fine with the separate library approach. By overloading on custom number types (in particular, struct-wrapped NTuple{W,Core.VecElement{T}}) and making using these easier (e.g. LoopVectorization), I don't have any problem opting into different implementations. And others can opt in as well.

I'll probably make one with the LGPL2 license and just put direct translations of the GLIBC vector math functions in there. I don't think the license would be a problem for other Julia packages wishing to depend on it, and it's a lot easier to get good accuracy/performance if you can just read their source code.
I don't know how much I can say about implementation differences to help here.
Is there much of a reason to not simply let the package copy the source's open license, and save us from the effort required to keep things MIT?

But while I can get along perfectly fine without it, other people may want an easier route than taking on fairly heavy dependencies (i.e., LoopVectorization to make switching exp implementations easy in loops/broadcasts). I think this point is largely moot until @simd does succeed in vectorizing it, which (as shown by @avx gexp) would dramatically boost performance.

Perhaps @oscardssmith, you could try myexp_fast(x::Float64) without the branch for the domain check
and set environmental variables like (I think the docs need updating) to get diagnostics, with a posix shell:

export JULIA_LLVM_ARGS="--pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize"

or fish:

set -x JULIA_LLVM_ARGS --pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize

and see if you can find out why @simd doesn't simd it, or better yet, get it to.

@oscardssmith
Copy link
Member Author

In my newest commit, I get

Tol debug failed 0.0% of the time.
myexp             : max 2.781946 at x = 137.23032                       : mean 0.532910

Turns out taylor series, while great aren't the greatest. Thanks to Remez.jl for the easy way to find the coefs!

@chriselrod
Copy link
Contributor

If you remove the branch and write an @inbounds @simd ivdep loop, myexp will vectorize.
The ivdep is necessary, otherwise we get:

remark: simdloop.jl:75:0: loop not vectorized: cannot identify array bounds
remark: simdloop.jl:75:0: loop not vectorized

Given that we don't get this without myexp, the problem may be with J_TABLE. Turning it into a tuple (where tbaa data should indicate non-aliasing) didn't allow it to vectorize.

@oscardssmith
Copy link
Member Author

Note that when vectorized, this implementation is about 10x faster than the current implementation and in fact is as fast as the LoopVectorization version (Glibc). Should we get Triage to make a decision on whether this slight accuracy loss is acceptable for a 4x faster (scalar) or 10x faster (vectorized). Given how widely used exp is in high performance code, I think that this performance increase is worth it even if it 5% of the time has >1ulp, and 0.28% of the time has >1.5ulp.

@oscardssmith
Copy link
Member Author

Can someone explain what the missing GC roots failure is in the analyzegc test? @yuyichao maybe? Thanks!

@vtjnash
Copy link
Member

vtjnash commented Sep 14, 2020

Probably unrelated? Try pushing a squashed+rebased copy, as the commit list looks a little mixed up too.

@oscardssmith
Copy link
Member Author

I apparently don't know how to squash this properly. What do I need to do to make that work? I think I have the rebase done correctly though.

@jmert
Copy link
Contributor

jmert commented Sep 14, 2020

I apparently don't know how to squash this properly. What do I need to do to make that work? I think I have the rebase done correctly though.

I'm seeing 191 commits now, so I think something still went wrong with the rebase.

For something "complicated" like this has gotten, I usually go incrementally toward a fixed-up state.

Make sure your master branch is up-to-date. (Since you're pushing to a fork, make sure your local master is in sync with JuliaLang/julia#master and not oscarssmith/julia#master.) Then back on your patch-2 branch, try git rebase -i master — you should then see the full list of 191 commit SHAs and message summaries in your text editor. Delete all of the lines that are not meant to be part of this PR. Save the file and quit, and git should perform the rebase and remove all of the extra commits. If everything applies cleanly, you should be down to your ~dozen or less actual commits. If it requires intervention to fix conflicts, make the appropriate edits, git commit --amend, and git rebase --continue. If things get confused/mixed up, you can always abort with git rebase --abort and try again.

After that first iteration is done, you can then do the squash. Again, start with git rebase -i master, but this time in the editor, you can change the prefix pick to squash to combine the marked commit with the commit above it in the list. A series of squashs following each other are allowed. Do that to whatever you want to combine, and save and exit. For each series of pick, squash, squash, ..., squash, you'll have the chance to edit the commit message again to clean-up messages/combine the message, etc. (There's also the option of changing pick to fixup which is much like squash, but git just ignores the commit message on fixup commits and doesn't ask you about editting the commit message — kinda like doing a git commit --amend retroactively.)

Once that is done, you can check that there's only the number of squashed commits you want with git log --oneline master..patch-2. If everything looks OK at that point, you can then push. If something still looks off (and especially if you think you've "lost" something), don't push since you can restart the process from what's already pushed to Github.

This is based on the Glibc algorithm which @chriselrod (Elrond on discourse) described the algorithm of for me. It appears to be about 2x faster than the current algorithm, and equally accurate over the range for which I have tried it. It also theoretically should be easier to vectorize as branches are only used for checking for over/underflow.

Update base/special/exp.jl

Co-authored-by: Jeff Bezanson <jeff.bezanson@gmail.com>

Better subnormal numbers, constant usage, and more accurate

Break r into a hi and lo part to get extra accuracy.

Fix previous comit.

Error matches gexp

Switch to minimax polynomial from taylor polynomial.

equally fast version with a smaller table.

Uses a quartic which allows a smaller table. Performance is equal to slightly better, and accuracy is similar.

fully working?

fix expm1 for Float32 (calling wrong libm function)

actually fix expm1

re-add exp10 docs

slightly extend upper range for Float32 arguments

re-add exp doctest, remove redundant exp10 doctest

maybe now

placing it after

use round instead of magic

re-add magic with better explanation

remove the typo

Fix all numbers taking the slow path with range checking.

Oops

Fix 32 bit build.

Use `:ℯ` instead of `float64 ℯ`

no functional change but less hacky

Update base/special/exp.jl

Co-authored-by: jmert <2965436+jmert@users.noreply.github.com>

Update base/special/exp.jl

Co-authored-by: jmert <2965436+jmert@users.noreply.github.com>

Update base/special/exp.jl

Co-authored-by: jmert <2965436+jmert@users.noreply.github.com>

Line wrap table
@oscardssmith
Copy link
Member Author

Merge time?

for (func, base) in (:exp2=>Val(2), :exp=>Val(:ℯ), :exp10=>Val(10))
@eval begin
($func)(x::Real) = ($func)(float(x))
@inline function ($func)(x::T) where T<:Float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we now adding @inline to each of the exp functions? I'm not sure if this is appropriate, as before we did not enforce inlining of these elementary functions. Is there a good reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inlining allows vectorization when combined with @simd ivdep. It often can yield 2x performance improvements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inline before would have been useless as the old versions had way too many branches to benefit from vectorization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

Copy link
Contributor

@pkofod pkofod Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still this is something to be decided upon. When I inlined sin and cos it caused some compiled code to grow a lot, so I had to undo it #24117 (unfortunately that PR doesn't link the original issue which I believe was by @JeffBezanson ). I don't know if this applies here as well (I'm not smart enough to know if the table will also be inlined into each function that then calls exp).

Quick question: when we talk about performance improvements here, are they then your new inlined versions versus the old un-inlined (?) versions? Or did you then also apply inline to the existing versions (which would only seem fair as we could also just inline the current versions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it on this branch, but you can use the new inline-cost printer to see what might be holding this back from being automatically inlined. Demo on the current exp:

julia> Base.print_statement_costs(exp, (Float64,))
exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:74
  1 1 ── %1   = Base.bitcast(UInt64, _2)::UInt64
  1%2   = Base.and_int(%1, 0x7fffffffffffffff)::UInt64
  1%3   = Base.bitcast(Base.Int64, _2)::Int64
  1%4   = Base.slt_int(%3, 0)::Bool
  1%5   = Base.ult_int(0x40862e42fefa39ef, %2)::Bool
  0 └───        goto #12 if not %5
  1 2 ── %7   = Base.ule_int(0x7ff0000000000000, %2)::Bool
  0 └───        goto #8 if not %7
  1 3 ── %9   = Base.and_int(%2, 0x000fffffffffffff)::UInt64
  1%10  = Base.sle_int(0, 0)::Bool
  1%11  = Base.bitcast(UInt64, 0)::UInt64
  1%12  = (%9 === %11)::Bool
  1%13  = Base.and_int(%10, %12)::Bool
  0%14  = Base.not_int(%13)::Bool
  0 └───        goto #5 if not %14
  0 4 ──        return NaN
  0 5 ──        goto #7 if not %4
  0 6 ──        return 0.0
  0 7 ──        return Inf
  2 8 ── %20  = Base.lt_float(709.782712893384, _2)::Bool
  0 └───        goto #10 if not %20
  0 9 ──        return Inf
  2 10%23  = Base.lt_float(_2, -745.1332191019412)::Bool
  0 └───        goto #12 if not %23
  0 11return 0.0
  2 12%26  = Base.eq_float(_2, 1.0)::Bool
  0 └───        goto #15 if not %26
  0 13 ─        goto #15 if not true
  0 14return 2.718281828459045
  1 15%30  = Base.ult_int(0x3fd62e42fefa39ef, %2)::Bool
  0 └───        goto #27 if not %30
  1 16%32  = Base.ult_int(%2, 0x3ff0a2b23f3bab73)::Bool
  0 └───        goto #21 if not %32
  0 17 ─        goto #19 if not %4
  1 18%35  = Base.add_float(_2, 0.6931471803691238)::Float64
  0 └───        goto #20
  1 19%37  = Base.sub_float(_2, 0.6931471803691238)::Float64
  0 20%38  = φ (#18 => -1.9082149292705877e-10, #19 => 1.9082149292705877e-10)::Float64
  0%39  = φ (#18 => %35, #19 => %37)::Float64
  0%40  = φ (#18 => -1, #19 => 1)::Int64
  0 └───        goto #22
  4 21%42  = Base.mul_float(1.4426950408889634, _2)::Float64
 10%43  = Base.rint_llvm(%42)::Float64
  1%44  = Base.fptosi(Int64, %43)::Int64
  5%45  = Base.muladd_float(%43, -0.6931471803691238, _2)::Float64
  4 └─── %46  = Base.mul_float(%43, 1.9082149292705877e-10)::Float64
  0 22%47  = φ (#20 => %38, #21 => %46)::Float64
  0%48  = φ (#20 => %39, #21 => %45)::Float64
  0%49  = φ (#20 => %40, #21 => %44)::Int64
  1%50  = Base.sub_float(%48, %47)::Float64
  4%51  = Base.mul_float(%50, %50)::Float64
  5%52  = Base.muladd_float(%51, 4.1381367970572385e-8, -1.6533902205465252e-6)::Float64
  5%53  = Base.muladd_float(%51, %52, 6.613756321437934e-5)::Float64
  5%54  = Base.muladd_float(%51, %53, -0.0027777777777015593)::Float64
  5%55  = Base.muladd_float(%51, %54, 0.16666666666666602)::Float64
  4%56  = Base.mul_float(%51, %55)::Float64
  1%57  = Base.sub_float(%50, %56)::Float64
  4%58  = Base.mul_float(%50, %57)::Float64
  1%59  = Base.sub_float(2.0, %57)::Float64
 20%60  = Base.div_float(%58, %59)::Float64
  1%61  = Base.sub_float(%47, %60)::Float64
  1%62  = Base.sub_float(%61, %48)::Float64
  1%63  = Base.sub_float(1.0, %62)::Float64
  1%64  = Base.slt_int(-52, %49)::Bool
  0 └───        goto #26 if not %64
  1 23%66  = (%49 === 1024)::Bool
  0 └───        goto #25 if not %66
 20 24%68  = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 2.0, 1023.0, 1023.0, 2.0))::Float64
  4%69  = Base.mul_float(%63, 2.0)::Float64
  4%70  = Base.mul_float(%69, %68)::Float64
  0 └───        return %70
  1 25%72  = Base.add_int(1023, %49)::Int64
  1%73  = Base.bitcast(UInt64, %72)::UInt64
  1%74  = Base.sle_int(0, 52)::Bool
  1%75  = Base.bitcast(UInt64, 52)::UInt64
  1%76  = Base.shl_int(%73, %75)::UInt64
  1%77  = Base.neg_int(52)::Int64
  1%78  = Base.bitcast(UInt64, %77)::UInt64
  1%79  = Base.lshr_int(%73, %78)::UInt64
  1%80  = Base.ifelse(%74, %76, %79)::UInt64
  1%81  = Base.bitcast(Float64, %80)::Float64
  4%82  = Base.mul_float(%63, %81)::Float64
  0 └───        return %82
  1 26%84  = Base.add_int(1023, 52)::Int64
  1%85  = Base.add_int(%84, 1)::Int64
  1%86  = Base.add_int(%85, %49)::Int64
  1%87  = Base.bitcast(UInt64, %86)::UInt64
  1%88  = Base.sle_int(0, 52)::Bool
  1%89  = Base.bitcast(UInt64, 52)::UInt64
  1%90  = Base.shl_int(%87, %89)::UInt64
  1%91  = Base.neg_int(52)::Int64
  1%92  = Base.bitcast(UInt64, %91)::UInt64
  1%93  = Base.lshr_int(%87, %92)::UInt64
  1%94  = Base.ifelse(%88, %90, %93)::UInt64
  1%95  = Base.bitcast(Float64, %94)::Float64
 20%96  = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 2.0, -53.0, -53.0, 2.0))::Float64
  4%97  = Base.mul_float(%63, %95)::Float64
  4%98  = Base.mul_float(%97, %96)::Float64
  0 └───        return %98
  1 27%100 = Base.ult_int(%2, 0x3e30000000000000)::Bool
  0 └───        goto #29 if not %100
  1 28%102 = Base.add_float(1.0, _2)::Float64
  0 └───        return %102
  4 29%104 = Base.mul_float(_2, _2)::Float64
  5%105 = Base.muladd_float(%104, 4.1381367970572385e-8, -1.6533902205465252e-6)::Float64
  5%106 = Base.muladd_float(%104, %105, 6.613756321437934e-5)::Float64
  5%107 = Base.muladd_float(%104, %106, -0.0027777777777015593)::Float64
  5%108 = Base.muladd_float(%104, %107, 0.16666666666666602)::Float64
  4%109 = Base.mul_float(%104, %108)::Float64
  1%110 = Base.sub_float(_2, %109)::Float64
  4%111 = Base.mul_float(_2, %110)::Float64
  1%112 = Base.sub_float(%110, 2.0)::Float64
 20%113 = Base.div_float(%111, %112)::Float64
  1%114 = Base.sub_float(%113, _2)::Float64
  1%115 = Base.sub_float(1.0, %114)::Float64
  0 └───        return %115

If any of your costs don't seem appropriate then we can make adjustments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really neat feature.

But I think the issue here is that we don't necessarily want it to inline normally, but we do if it's in an @simd ivdep loop, because
(a) In a loop, there are many calls, so the code size cost is amortized by a greater number of calls.
(b) The big one here: inlining lets it SIMD, so the performance difference between inlined vs not-inlined is much higher (> 5x on my computer).

Therefore, the ideal behavior would probably be something along the lines of "inline if it's inside an @simd ivdep loop, don't inline otherwise".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle it shouldn't be too hard to adjust the threshold, because of the annotation of the loop with Expr(:loopinfo, Symbol("julia.simdloop"), nothing). It might be a little sensitive to when in the optimizer pipeline that runs, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's pretty or preferable but you could define _exp or inline_exp instead of the current exp and then define an exp (that doesn't inline) so normal users would not have this happen but advanced users could specifically ask for it. I'm above my paygrade here though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, so am I :)

@oscardssmith
Copy link
Member Author

Bumping this. The remaining test failure is unrelated, and while we could bike-shed whether this should be inlined and similar things for ever, I think the benefits over the current implementation are such that we should prioritize getting this in in some form for 1.6.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Oct 1, 2020
@oscardssmith
Copy link
Member Author

@StefanKarpinski why is this labeled triage? Is there anything that requires decision here?

@musm
Copy link
Contributor

musm commented Oct 2, 2020

I think @simonbyrne mentioned he has time next week to take a detailed look. That's probably the last step.

Also the inlining issue, IMO I don't think we should inline them by default

@musm musm removed the triage This should be discussed on a triage call label Oct 2, 2020
@jmert jmert mentioned this pull request Oct 2, 2020
@oscardssmith
Copy link
Member Author

I'll push a non-inlined version tonight. Anything the pr needs?

@oscardssmith
Copy link
Member Author

@simonbyrne any chance you have some free time to review this? If it will would help, I'd be happy to set up a call so I could answer questions synchronously.

@oscardssmith
Copy link
Member Author

@Keno you said that this is ready to merge in your opinion once it gets a CI re-run. Right?

@KristofferC KristofferC merged commit 0097bdd into JuliaLang:master Oct 30, 2020
@KristofferC
Copy link
Member

Oops.. I thought CI was already rerun here :/ My bad. Let's see how it goes.

@stevengj
Copy link
Member

stevengj commented Nov 3, 2020

I'm getting a loss of accuracy in a SpecialFunctions test due to this change; currently trying to track it down further.

@oscardssmith
Copy link
Member Author

Hm, that's weird. this should be strictly more accurate. Are you sure it's not the sinh/cosh reltated? That PR changed the accuracy profile of the functions a bit (mostly for the better, but I'm not sure that all inputs got better).

@stevengj
Copy link
Member

stevengj commented Nov 3, 2020

I tracked down the problem. The new exp was indeed more accurate, but @august198 had apparently computed arbitrary precision reference results for the specific inputs computed via the old exp, so the slight change in the inputs was messing up our test.

False alarm, sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maths Mathematical functions performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.