Add pure julia exp function #19831

Merged
merged 2 commits into from Jan 9, 2017

Projects

None yet

8 participants

@musm
Contributor
musm commented Jan 3, 2017 edited

Need to add benchmarks, but it is faster or the same speed as the openlibm version between 1.0 (same speed) to 1.15 to to 1.3 to 1.5 times faster depending on which branch you test. (Testing using BenchmarkTools). On most branches it's faster, I think there is only one branch where it's a basically the same speed.

Accuracy is basically the same (both based on msun version)

base/special/exp.jl
+function exp{T<:Union{Float32,Float64}}(x::T)
+ xu = reinterpret(Unsigned, x)
+ xs = xu & ~sign_mask(T)
+ xsb = Int(xu >> Unsigned(sizeof(T)*8 - 1))
@musm
musm Jan 3, 2017 edited Contributor

Probably just as efficient to use signbit, maybe I should use it instead

@stevengj
stevengj Jan 3, 2017 Member

ya, would be clearer

+using Base: sign_mask, exponent_mask, exponent_one, exponent_bias,
@musm
musm Jan 3, 2017 Contributor

No need to import any of these since we don't extend them

@musm
Contributor
musm commented Jan 3, 2017

i686 error is the usual failing error and is unrelated

base/math.jl
@@ -70,11 +71,23 @@ end
# evaluate p[1] + x * (p[2] + x * (....)), i.e. a polynomial via Horner's rule
macro horner(x, p...)
+ @gensym val
@stevengj
stevengj Jan 3, 2017 Member

Why is this necessary? Won't the macro hygiene automatically convert t into a unique name?

@musm
musm Jan 3, 2017 Contributor

Yes, good point. I think this was an issue in an earlier version of Julia, but in 0.6 converts it into a unique name.

base/special/exp.jl
+MINEXP(::Type{Float32}) = -103.97207708f0 # log 2^-150
+
+# coefficients from: lib/msun/src/e_exp.c
+@inline exp_kernel{T<:Float64}(x::T) = @horner_oftype(x, 1.66666666666666019037e-1,
@JeffBezanson
JeffBezanson Jan 3, 2017 Member

x::Float64 ?

base/math.jl
+
+# evaluate p[1] + x * (p[2] + x * (....)), i.e. a polynomial via Horner's rule
+# and convert coefficients to same type as x
+macro horner_oftype(x, p...)
@stevengj
stevengj Jan 3, 2017 Member

Why is this macro needed? If you change the floating-point type, you generally will change the degree of the polynomial as well, and you can just use literals of the correct type, no? e.g. it looks like you only use it for Float64 and Float32 below, and Julia has a compact syntax for Float32 literals (append f0).

@musm
musm Jan 3, 2017 Contributor

I agree that it will generally change the degree of the polynomial. That being said, for example one case where it doesn't really change is if you also want to have native Float16 support (although on CPUs this is useless) the dispatch on the same polynomial (the one for ::Float32) works well in practice (from many cases on different function I found it's really hard to find good coefficients by reducing the degree from the Float32 version for Float16). For example this function will work just as well within accuracy for Float16.  Thus it's helpful to define this macro to take care of such dispatch cases.

@stevengj
stevengj Jan 3, 2017 edited Member

I would rather not add the extra macro for an odd case (it is unusual that you wouldn't be able to reduce the degree for Float16) that we don't really implement (probably on CPUs we will just compute exp(x::Float16) by converting to/from Float32, no?).

@musm
musm Jan 3, 2017 Contributor

I would rather not add the extra macro for an odd case (it is unusual that you wouldn't be able to reduce the degree for Float16) that we don't really implement

Not really unusual for Float16 (as opposed to going from Float128 -> Float64). You have a lot less precision so it's very hard to find an even smaller degree polynomial that retains the same accuracy constraint ( < 1 ulp). This is the hardest/time consuming part testing accuracy and the polynomial coeffs.

  (probably` on CPUs we will just compute exp(x::Float16) by converting to/from Float32, no).

yes

I will remove anyways

base/special/exp.jl
+MINEXP(::Type{Float32}) = -103.97207708f0 # log 2^-150
+
+# coefficients from: lib/msun/src/e_exp.c
+@inline exp_kernel{T<:Float64}(x::T) = @horner_oftype(x, 1.66666666666666019037e-1,
@stevengj
stevengj Jan 3, 2017 Member

Why not just exp_kernel(x::Float64)?

@musm
musm Jan 3, 2017 edited Contributor

Originally my motivation was to have two or more polynomials depending on the float type and use dispatch as follows:
- A low degree polynomial for Float16 and Float32
- A higher degree polynomial for FLoat64 and other Floats like Float128

so in: https://github.com/JuliaMath/Amal.jl/blob/master/src/exp.jl

I have the two polynomial kernels dispatch the following types

SmallFloat = Union{Float16,Float32}
LargeFloat = Float64   # perhaps Union{Float64,Float128} in the future

That grand goal might be a pipedream. So if you all prefer I can restrict the types instead of using this and the @horner_oftype machinery. (although it does have some elegance  to have this this current machinery with horner_oftype and the polynomial dispatch on the type)

@stevengj
stevengj Jan 3, 2017 edited Member

I would prefer eliminating the extra macro for now.

(Wouldn't the Float16 computations be faster in Float32 arithmetic anyway? And wouldn't we need a higher degree for Float128 than for Float64?)

@musm
musm Jan 3, 2017 Contributor

Yep it would be more efficient on the CPU. The motivation was for GPU code where you do have native Float16 hardware and there is a real benefit (there is some very nice work coming along in having better Float16 support). For Float128 I have not tested yet, but I suspect so.

I guess I can revisit when the time comes and restrict the polynomial dispatch and the additional macro.

base/special/exp.jl
+ # Float64, which is well within the allowable error. however,
+ # 2.718281828459045 is closer to the true value so we prefer that answer,
+ # given that 1.0 is such an important argument value.
+ if x == T(1.0) && T == Float64
@stevengj
stevengj Jan 3, 2017 Member

Maybe x == T(1.0) && return T(2.718281828459045235360) so that we get the exactly rounded result for Float32 too?

@musm
musm Jan 3, 2017 Contributor

Float32 is exactly rounded without this. So it's better to do this to remove the extra check for the Float32 version.

@musm
musm Jan 4, 2017 Contributor

What's the difference between T==Float64 and T===Float64 I'm leaning towards using the later but for no reason

base/special/exp.jl
+ end
+ else
+ n = round(T(LOG2E)*x)
+ k = unsafe_trunc(Int,n)
@stevengj
stevengj Jan 3, 2017 Member

It seems bad that this will behave differently on 32-bit and 64-bit machines?

@musm
musm Jan 3, 2017 Contributor

No it doesn't matter in this case. k is the integer representing the exponent of the float, so there is no way it will cause issues since the number will overflow way past when the Int32 will. We could all the same just use Int32 but I think it might be nicer to just use the machine size Int ?

@stevengj
stevengj Jan 3, 2017 edited Member

Better to use Int32, I think. (We could even use Int16, it sounds like, but there's probably no advantage to that.)

@musm
musm Jan 3, 2017 Contributor

I made the change, but I don't fully understand why we should prefer Int32 over Int. I tried both and it's not like there is any critical difference in the resulting code.

@stevengj
Member
stevengj commented Jan 3, 2017

Would be good to have test cases that exercise each of the branches, including test cases just above and just below the various thresholds like T(0.5)*T(LN2) (since that's usually where the error is maximized).

@stevengj stevengj added the maths label Jan 3, 2017
@stevengj
Member
stevengj commented Jan 3, 2017

Out of curiosity, why is it so much faster than the openlibm version? More inlining?

@musm
Contributor
musm commented Jan 3, 2017

Would be good to have test cases that exercise each of the branches, including test cases just above and just below the various thresholds like T(0.5)*T(LN2) (since that's usually where the error is maximized).

I have carefully tested this function here: https://github.com/JuliaMath/Amal.jl/tree/master/test
(-10:0.0002:10, -1000:0.001:1000, -120:0.0023:1000, -1000:0.02:2000) (and more dense cases locally) (see https://github.com/JuliaMath/Amal.jl/blob/master/test/accuracy.jl) and have compared it with the Base version.

@musm
Contributor
musm commented Jan 3, 2017 edited

Out of curiosity, why is it so much faster than the openlibm version? More inlining?

Hard for me to tell without seeing something like code_llvm or for the openlibm version. It might be a combination of more inlining  for some of the constants (openlibm uses an array to store them for some of them). Is there any (even if minor) function overhead to calling C code? These functions run in 6-15ns so even a 1ns overhead could have an impact.

@stevengj
Member
stevengj commented Jan 3, 2017 edited

@musm, I don't doubt that you have tested your code, but we should have the key tests in our test suite, at the very least to make sure that we don't get regressions in the future.

@musm
Contributor
musm commented Jan 3, 2017

Addressed some of the review comments.

I added a function to float.jl that gives us the corresponding Int size of a Float type.

fpintsize(::Type{Float64}) = UInt64
fpintsize(::Type{Float32}) = UInt32

This would make this and one of the other float function cleaner

reinterpret(T, rem(Int32(32), fpintsize(T))
# vs.  what we use now
xu = reinterpret(Unsigned,x)
reinterpret(T, rem(Int32(32), typeof(xu))

Need to still add tests and benchmarks

base/float.jl
+# integer size of float
+fpintsize(::Type{Float64}) = UInt64
+fpintsize(::Type{Float32}) = UInt32
+fpintsize(::Type{Float16}) = UInt16
@stevengj
stevengj Jan 3, 2017 Member

fpintsize sounds like it should return a number (64 or 8)... maybe fpint or fpinttype?

@musm
musm Jan 3, 2017 Contributor

fpinttype sounds good to me

base/float.jl
+@pure exponent_raw_max{T<:AbstractFloat}(::Type{T}) = Int(exponent_mask(T) >> significand_bits(T))
+
+
+
@musm
musm Jan 3, 2017 Contributor

need to get rid of all these additional white spaces :)

base/special/exp.jl
+ -1.65339022054652515390e-6, 4.13813679705723846039e-8)
+
+# coefficients from: lib/msun/src/e_expf.c
+@inline exp_kernel(x::Float32) = @horner(x, 1.6666625440f-1, -2.7667332906f-3)
@stevengj
stevengj Jan 3, 2017 Member

(This is such a low degree that I wonder if we'd be better off using a minimax polynomial rather than a minimax rational function, at least for the single-precision case — the optimal polynomial would require a higher degree, but it would avoid the division. I don't think we need to address that possibility in this PR, however, since this is no worse than what we are doing now.)

@musm
musm Jan 3, 2017 edited Contributor

This a good question. I have already played with many variations and my conclusion is that the msun version is the best of all things I have tested and played with considering both accuracy (critical goal of being < 1 ulp) and performance.

Here's the problem with the minmax polynomial approach for the exp function:

on my test suite for Float32
FMA system rational
max 0.86076152 at x = -5.85780 mean 0.03712445

NON-FMA system rational
max 0.86076152 at x = -5.85780 mean 0.03712434

FMA system polynomial
max 0.84459400 at x = -48.8587 mean 0.03627487

NON-FMA system polynomial
max 1.09986448 at x = 5.23060 mean 0.03798720

over the range: vcat(-10:0.0002:10, -1000:0.001:1000, -120:0.0023:1000, -1000:0.02:2000)
(errors may be larger, since we haven't tested all floats)

as you can see it's not possible to be under 1 ulp for non-fma systems using this polynomial (below)

What about speed?
Well I tested 3 hot branches
and the polynomial version on an fma system is ~ 0.5-0.8 ns slower

For reference this is the best minmax polynomial I can find for the Float32

@inline exp_kernel{T<:SmallFloat}(x::T) = @horner_oftype(x, 1.0, 1.0, 0.5,
    0.1666666567325592041015625,
    4.1666455566883087158203125e-2,
    8.333526551723480224609375e-3,
    1.39357591979205608367919921875e-3,
    1.97799992747604846954345703125e-4)

Note it's not as simple as adding more degrees to this polynomial to improve the accuracy
This is the best minimal minmax polynomial I was able to find that meats < 1 ulp

For Float64

@inline exp_kernel{T<:LargeFloat}(x::T) = @horner_oftype(x, 1.0, 1.0, 0.5,
    0.16666666666666685170383743752609007060527801513672,
    4.1666666666666692109277647659837384708225727081299e-2,
    8.3333333333159547579027659480743750464171171188354e-3,
    1.38888888888693412537733706813014578074216842651367e-3,
    1.9841269898657093212653024227876130680670030415058e-4,
    2.4801587357008890921336585755341275216778740286827e-5,
    2.7557232875898009206386968239499424271343741565943e-6,
    2.7557245320026768203034231441428403286408865824342e-7,
    2.51126540120060271373185023340013355408473216812126e-8,
    2.0923712382298872819985862227861600493028504388349e-9)

I haven't tested this in a while but the same conclusions holds

base/special/exp.jl
+@inline exp_kernel(x::Float32) = @horner(x, 1.6666625440f-1, -2.7667332906f-3)
+
+# custom coefficients (slightly better mean accuracy, but also a bit slower)
+# @inline exp_kernel(x::Float32) = @horner_oftype(x, 0.1666666567325592041015625f0,
@musm
musm Jan 3, 2017 edited Contributor

I need to edit this to @horner or delete the comment entirely

base/special/exp.jl
@@ -0,0 +1,137 @@
+# Based on FreeBSD lib/msun/src/e_exp.c
+# ====================================================
+# Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved. Permission
@tkelman
tkelman Jan 4, 2017 Member

need to add exceptions to the license file and contrib/add_license_to_files.jl

@musm
musm Jan 4, 2017 Contributor

I don't' really know what you mean. What do I need to do?

I looked under here for example:
https://github.com/JuliaLang/openlibm/blob/master/LICENSE.md

This part is kinda relevant

OpenLibm

OpenLibm contains code that is covered by various licenses.

The OpenLibm code derives from the FreeBSD msun and OpenBSD libm implementations, which in turn derives from FDLIBM 5.3. As a result, it has a number of fixes and updates that have accumulated over the years in msun, and also optimized assembly versions of many functions. These improvements are provided under the BSD and ISC licenses. The msun library also includes work placed under the public domain, which is noted in the individual files. Further work on making a standalone OpenLibm library from msun, as part of the Julia project is covered under the MIT license. The test files, test-double.c and test-float.c are under the LGPL.
@musm
musm Jan 4, 2017 Contributor

Do I just need to add a link to LICENSE.md to openlibm (it's already there) and say that was used as a reference or something?

@tkelman
tkelman Jan 4, 2017 Member

The Julia license file needs to have an exception here, and the script that adds license headers needs to add an exception for this file. It's derived code so should usually retain its original licensing

@musm
musm Jan 4, 2017 Contributor

Is there an example I can copy to do this?

@tkelman
tkelman Jan 4, 2017 Member

though I would actually put it in the "The following components of Julia's standard library have separate licenses:" section since it's translated Julia code rather than directly copied C code

@musm
musm Jan 4, 2017 Contributor

So I should add this line

- base/special/exp.jl' (see [FREEBSD MSUN](https://github.com/freebsd/freebsd) [FreeBSD/2-clause BSD/Simplified BSD License])

below
The following components of Julia's standard library have separate licenses:

even if it's derived code

@musm
musm Jan 4, 2017 Contributor
# ====================================================
# Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved. Permission
# to use, copy, modify, and distribute this software is freely granted,
# provided that this notice is preserved.
# ====================================================

do I even need to include this? We are distributing the original file.

Is it necessary to add a line here https://github.com/JuliaLang/julia/blob/master/contrib/add_license_to_files.jl#L39 to skip this?
I'm not sure. Parts of this include ported and derived code?

@tkelman
tkelman Jan 4, 2017 Member

Copyright sun microsystems, permission to modify and distribute granted provided this notice is preserved. So it's not MIT license, copyright JuliaLang contributors like the majority of the rest of base. We're not distributing the original file in this repo under our MIT license.

@musm
musm Jan 4, 2017 Contributor

Ok. I added the commit to address this : 46058d3

@tkelman
tkelman Jan 4, 2017 Member

You only did half of it.

Is it necessary to add a line here https://github.com/JuliaLang/julia/blob/master/contrib/add_license_to_files.jl#L39 to skip this?

Yes. "exceptions to the license file and contrib/add_license_to_files.jl"

@KristofferC
Contributor

It is very cool that it is possible to do this in julia.

@musm
Contributor
musm commented Jan 4, 2017 edited

Running nano soldier after JuliaCI/BaseBenchmarks.jl#58 should be an interesting test since there are quite a few other files in BaseBenchmarks that use the exp function

@musm
Contributor
musm commented Jan 5, 2017 edited

Ok so the benchmarks have been merged.

@musm
Contributor
musm commented Jan 7, 2017

Should we trigger nanosoldier now?

@vchuravy
Member
vchuravy commented Jan 7, 2017 edited

@nanosoldier runbenchmarks(ALL, vs=":master")

Also this should be squashed before committing it to master.

@nanosoldier

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@stevengj
Member
stevengj commented Jan 7, 2017

Performance looks good; the regressions are probably noise. But I don't think nanosoldier was running the new exp benchmarks?

@musm
Contributor
musm commented Jan 8, 2017

Can we retrigger the benchmarks? They were just uploaded to nanosoldier. Thanks

@Sacha0
Contributor
Sacha0 commented Jan 8, 2017

@nanosoldier runbenchmarks(ALL, vs=":master")

base/special/exp.jl
+ # compute approximation
+ z = r*r
+ p = r - z*exp_kernel(z)
+ y = T(1.0) - ((lo - (r*p)/(T(2.0) - p)) - hi) # compute polynomial on reduced arugment
@tkelman
tkelman Jan 8, 2017 Member

argument

@nanosoldier

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

musm added some commits Jan 3, 2017
@musm musm Add pure julia exp function 84bf356
@musm musm Add a few more ldexp tests
fe57cef
@musm
Contributor
musm commented Jan 8, 2017 edited

Squashed and g2g on my end. The regressions don't make sense to me since those files don't even use 'exp'. Not sure why we also don't see more improvements as I locally have benchmarked on other branches in exp, but I think this is just some resolution/noise issue on nanosoldier.

@musm
Contributor
musm commented Jan 9, 2017

bump

@stevengj stevengj merged commit f864b22 into JuliaLang:master Jan 9, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@musm musm deleted the musm:expfun branch Jan 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment