Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved implementation of hypot(a,b) #31922

Merged
merged 12 commits into from
Jun 13, 2019
Merged

improved implementation of hypot(a,b) #31922

merged 12 commits into from
Jun 13, 2019

Conversation

cfborges
Copy link
Contributor

@cfborges cfborges commented May 3, 2019

Provides a fast and accurate implementation of hypot() that leverages
the fused multiply add where available. The approach is explained
and tested in detail in the paper:
An Improved Algorithm for hypot(a,b) by Carlos F. Borges
The article is available online at ArXiv at the link
https://arxiv.org/abs/1904.09481

cc: @simonbyrne

@simonbyrne simonbyrne requested a review from stevengj May 3, 2019 21:31
base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
@simonbyrne
Copy link
Contributor

simonbyrne commented May 3, 2019

Welcome, and thanks for the contribution, and the paper. Overall I think this looks pretty good.

Have you done any benchmarks? I was wondering if instead of frexp/ldexp, would it be faster to say doing something like

    if ax > sqrt(floatmax(T)/2)
        sx = ax * floatmin(T)
        sy = ay * floatmin(T)
        return sqrt(muladd(sx,sx,ay*ay)) / floatmin(T)
    elseif ay < sqrt(floatmin(T)
        sx = ax / floatmin(T)
        sy = ay / floatmin(T)
        return sqrt(muladd(sx,sx,ay*ay)) * floatmin(T)        
    else
        return sqrt(muladd(ax,ax,ay*ay))
    end

(here I am assuming that the compiler will convert the divisions into multiplications, which it should since floatmin(T) is a power of 2)?

@giordano
Copy link
Contributor

giordano commented May 3, 2019

Any hope to extend this and fix the vararg method? See #27141

@simonbyrne
Copy link
Contributor

Looks like it isn't getting hypot(Inf,NaN) == hypot(NaN,Inf) == Inf correct.

@cfborges
Copy link
Contributor Author

cfborges commented May 4, 2019

Is it not correct to return NaN if either argument is NaN since we have no idea what caused the NaN in the first place? Otherwise hypot(0/0,0/0) would give Inf which is not correct.

I haven't timed the rescaling using division/multiplication vs. frexp/ldexp but I'll look at that when I get back to the office on Monday. Mostly I didn't worry about that since it is an extremely unusual case. The code clears widely varying operands first which would clear almost anything in actual practice that involved an 'extreme' exponent (i.e. possible overflow/underflow). You would only need to rescale in the very odd event that you had both operands with nearby extreme exponents. All that said, I like the simple elegance of the rescaling you suggest.

@KlausC
Copy link
Contributor

KlausC commented May 5, 2019

I think, because hypot(a, +-Inf) == Inf for all real a , so it is reasonable to request hypot(NaN, Inf) == Inf, and hypot(Int, NaN) == Inf. The example hypot(0/0, 0/0)is not in this category, buthypot(Nan, Nan)`.

@jebej
Copy link
Contributor

jebej commented May 5, 2019

You could make the similar argument that hypot(NaN,a) == NaN for any a. It makes sense to me that here the NaN should poison the result. We could also argue that hypot(NaN,Inf) == √(NaN^2+Inf^2) == NaN.

FWIW, MATLAB gives NaN, and Python gives Inf for this operation.

@cfborges
Copy link
Contributor Author

cfborges commented May 5, 2019

Welcome, and thanks for the contribution, and the paper. Overall I think this looks pretty good.

Have you done any benchmarks? I was wondering if instead of frexp/ldexp, would it be faster to say doing something like

    if ax > sqrt(floatmax(T)/2)
        sx = ax * floatmin(T)
        sy = ay * floatmin(T)
        return sqrt(muladd(sx,sx,ay*ay)) / floatmin(T)
    elseif ay < sqrt(floatmin(T)
        sx = ax / floatmin(T)
        sy = ay / floatmin(T)
        return sqrt(muladd(sx,sx,ay*ay)) * floatmin(T)        
    else
        return sqrt(muladd(ax,ax,ay*ay))
    end

(here I am assuming that the compiler will convert the divisions into multiplications, which it should since floatmin(T) is a power of 2)?

One problem with this approach is that it can cause the other argument to go out of range. For example

x = 1e154
x > sqrt(floatmax(Float64)/2)
y = 1e147
y <= x*sqrt(eps(Float64)/2)
y = y*floatmin(Float64)
y < sqrt(floatmin(Float64))

Here x is too large and would overflow. Since y does not have a widely varying exponent we rescale. Unfortunately, the resulting y would then underflow on squaring.

One could use the square root of floatmin but that will be a problem if the exponent for that value is not even (it's even in Float64 but if this is to be fully general then it needs to work everywhere).

@giordano
Copy link
Contributor

giordano commented May 5, 2019

I think I kind of agree with @jebej:

julia> hypot(Inf, -1/0)
Inf

doesn't look necessarily a reasonable answer.

@cfborges
Copy link
Contributor Author

cfborges commented May 5, 2019

I would add that if the motivation behind hypot(x,y) is that it is just a more stable way of doing sqrt(x*x+y*y) then any NaN should poison the computation just as it would in sqrt(x*x+y*y)

@simonbyrne
Copy link
Contributor

The behaviour is specified in the IEEE754 standard (§9.2.1):

For the hypot function, hypot(±0, ±0) is +0, hypot(±∞, qNaN) is +∞, and hypot(qNaN, ±∞) is +∞.

My guess is that the logic is that NaN represents an unknown value, and since hypot(Inf, x) == Inf for all possible other values of x, then it should be for x=NaN as well. This is true for other functions, e.g.:

julia> NaN^0.0
1.0

@giordano
Copy link
Contributor

giordano commented May 5, 2019

My guess is that the logic is that NaN represents an unknown value, and since hypot(Inf, x) == Inf for all possible other values of x, then it should be for x=NaN as well.

This reasoning should hold also for addition, shouldn't it?

julia> Inf + NaN
NaN

@StefanKarpinski
Copy link
Sponsor Member

Inf + -Inf does not equal Inf.

@giordano
Copy link
Contributor

giordano commented May 5, 2019

Probably I got the idea now: Inf + any non negative real value (as it is in the case of hypot) will always be Inf, whatever the other value is, including ("non negative") NaN.

@simonbyrne
Copy link
Contributor

One could use the square root of floatmin but that will be a problem if the exponent for that value is not even (it's even in Float64 but if this is to be fully general then it needs to work everywhere).

Good point. We could define a specific constant for this, or use something like eps(T)/floatmin(T)?

@cfborges
Copy link
Contributor Author

cfborges commented May 6, 2019

Can't fight the standard so the inf issue is now compliant.

Decided to go with eps(sqrt(floatmin(T))) as the rescaling constant. It is always a power of the base and as long as the exponent range is roughly symmetric and sufficiently larger than the precision (true of all the standard floating point types) there won't be any problems.

I'll post the updates in a little bit.

Provides a fast and accurate implementation of hypot() that leverages
the fused multiply add where available. The approach is explained
and tested in detail in the paper:
 An Improved Algorithm for hypot(a,b) by Carlos F. Borges
The article is available online at ArXiv at the link
https://arxiv.org/abs/1904.09481
@simonbyrne
Copy link
Contributor

Overall I think this looks good.

hypot is somewhat poorly tested: can you add some tests for large/small values? (currently the math related tests are in test/math.jl. e.g.

@test hypot(floatmax(T),1.0) == floatmax(T)

and something to check that we're not losing precision due to gradual underflow?

@StefanKarpinski StefanKarpinski changed the title Replacement for hypot(a,b) improved implementation of hypot(a,b) May 22, 2019
@cfborges
Copy link
Contributor Author

cfborges commented May 23, 2019

I have noticed one compiler related issue that I'd love to get some input on. Specifically, it appears that although there are three nearly identical calls to muladd in the code the compiler on my machine only turns the last one into a fused multiply add (may have something to do with constant propagation but I really don't know). The end result is that hypot(K*x,K*y) for a K that is some large power of 2 (or very negative power of 2) might not be the same as K*hypot(x,y) . This problem disappears if I replace muladd with fma. Would love to get some opinions on whether to just dump muladd and use fma instead. Since the fused multiply-add is so common on today's architectures it seems like this is the way to go even though someone using a platform without it would see a slowdown.

@simonbyrne
Copy link
Contributor

simonbyrne commented May 23, 2019

That's interesting. @vchuravy any idea why this might happen?

One alternative that might help is to force them all through the same path, e.g.

    scale = one(T)
    if ax > sqrt(floatmax(T)/2)
        scale = 1/eps(sqrt(floatmin(T)))  #Rescaling constant
        ax = ax/scale
        ay = ay/scale
    elseif ay < sqrt(floatmin(T))
        scale = eps(sqrt(floatmin(T)))  #Rescaling constant
        ax = ax/scale
        ay = ay/scale
    end
    return scale*sqrt(muladd(ax,ax,ay*ay))

and add a test for the behaviour?

@cfborges
Copy link
Contributor Author

cfborges commented May 23, 2019

I did that and it does work I just hate adding a multiply by one. It rubs me the wrong way.

But if you think that's better than going to an fma call then I'll make the change.

cfborges and others added 5 commits May 29, 2019 16:10
Co-Authored-By: Simon Byrne <simonbyrne@gmail.com>
Co-Authored-By: Simon Byrne <simonbyrne@gmail.com>
Co-Authored-By: Simon Byrne <simonbyrne@gmail.com>
… the accuracy. It is now far better than the clib hypot code even without a fused multiply add. If you look at the ArXiv paper there is a plot that shows the stunning performance difference. This comes at the cost of a two more multiplies and one more divide (and some adds). Still way cheaper than the clib code
…e. It is 10 times better than the C math library hypot function in this respect.
@simonbyrne
Copy link
Contributor

Very nice! Should be good to merge once tests finish, unless you have more changes planned.

@cfborges
Copy link
Contributor Author

Very nice! Should be good to merge once tests finish, unless you have more changes planned.

No more changes. I believe I've taken it as far as my skills allow. I kept hoping to find a simple form that was always correctly rounded (I didn't want to use any Veltkamp-Dekker type tricks) but this is as close as I could get. You can squeeze a tiny bit more out by adding more branches but that would slow it down for almost no gain so I don't see the point.

@giordano
Copy link
Contributor

PR #30301 adds a few tests for edge cases that would be probably useful to make sure pass also in this PR.

@simonbyrne simonbyrne merged commit 4a04600 into JuliaLang:master Jun 13, 2019
@simonbyrne
Copy link
Contributor

Thanks!

@cfborges
Copy link
Contributor Author

Awesome. It's been fun. By the way, I do have a version that always returns the correctly rounded answer but it REQUIRES the fused multiply-add and gives poor results without it (results are similar to the Naive (Unfused) code in the paper if there is no fma). If that is of interest let me know how I might contribute it.

@cfborges cfborges deleted the hypot branch June 13, 2019 16:47
@simonbyrne
Copy link
Contributor

I'm not sure: the simplest option is to create a simple package. I'm not sure if there are existing packages where it could fit (https://github.com/JuliaIntervals/CRlibm.jl maybe?). @dpsanders might have some suggestions.

What's the performance like vs what you currently have? We don't currently have a great way to feature gate (see #9855). There is currently a line in the code:

const FMA_NATIVE = muladd(nextfloat(1.0),nextfloat(1.0),-nextfloat(1.0,2)) == -4.930380657631324e-32

but on my machine (which does have FMA) I get:

julia> Base.Math.FMA_NATIVE
false

so that doesn't really help.

@StefanKarpinski
Copy link
Sponsor Member

How does it do in terms of accuracy if the multiply add isn't fused? We have both fma (do the operation fused, even if that's slow) and muladd which uses an FMA instruction if present or just a normal multiply and add if there isn't an FMA instruction. We could always take the performance hit on older platforms that don't have an FMA instruction.

@StefanKarpinski
Copy link
Sponsor Member

julia> fma(nextfloat(1.0),nextfloat(1.0),-nextfloat(1.0,2))
4.930380657631324e-32

Wrong sign?

@simonbyrne
Copy link
Contributor

ah ha!

simonbyrne added a commit that referenced this pull request Jun 13, 2019
simonbyrne added a commit that referenced this pull request Jun 13, 2019
@cfborges
Copy link
Contributor Author

How does it do in terms of accuracy if the multiply add isn't fused? We have both fma (do the operation fused, even if that's slow) and muladd which uses an FMA instruction if present or just a normal multiply and add if there isn't an FMA instruction. We could always take the performance hit on older platforms that don't have an FMA instruction.

Without the fma it performs exactly like the Naive (Unfused) code from my paper. So one ulp errors about 17% of the time on normally distributed inputs (compare to 13% for the clib hypot). It requires 4 fma calls.

@dpsanders
Copy link
Contributor

CRlibm.jl is just a wrapper around the CRlibm library and I would prefer to leave it as such.

But it would be great if we could start writing a CorrectRounding.jl library to replace it!

@stevengj
Copy link
Member

stevengj commented Mar 20, 2024

@cfborges, I notice that you published this code in ACM TOMS.

This is a bit of a concern, since if you assigned the copyright to ACM (the default, I think?), then the code (which you no longer own) is by default licensed under the ACM TOMS license, which is not free/open-source — it is only free for "noncommercial use".

However, there is the possibility for the author of the work to request that ACM TOMS release the code under a more liberal license, or possibly at the time of publication the author could request retain copyright ownership of the work. Did you do this?

(The situation is a bit murky to me if you contributed this code to Julia before assigning copyright to ACM, but it seems better to clarify the situation by ensuring that the ACM TOMS code for your article is released under a free/open-source license like MIT or BSD.)

@cfborges
Copy link
Contributor Author

cfborges commented Mar 20, 2024 via email

@stevengj
Copy link
Member

stevengj commented Mar 20, 2024

Ah, great, that is covered by the ACM copyright release:
image
Assuming you checked that box, it should be all good? Looks like it, since there is a public-domain notice in your article text:
image
though it is not obvious when downloading the code.

It's weird that the ACM TOMS 1014 page of your article does not make this clear when you download the Supplemental Material (which doesn't list any copyright information that I can see?). You might want to email TOMS to request that they add a public domain notice on your Supplemental Material to clarify the copyright status of your code. Otherwise a reader might by default assume it is restricted by the ACM TOMS semi-free license.

@cfborges
Copy link
Contributor Author

cfborges commented Mar 20, 2024 via email

@stevengj
Copy link
Member

stevengj commented Mar 20, 2024

I edited my post to note that a public-domain notice indeed appears in your article text. But it would be nice for readers if ACM TOMS also showed such a notice on the Supplementary Material too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants