Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly rounded variant of the hypot code #32345

Merged
merged 2 commits into from
Jul 1, 2019
Merged

Correctly rounded variant of the hypot code #32345

merged 2 commits into from
Jul 1, 2019

Conversation

cfborges
Copy link
Contributor

This change implements the correctly rounded variant of the hypot code only in the case where there is a native FMA.

if Base.Math.FMA_NATIVE
hsquared = h*h
axsquared = ax*ax
h -= (fma(-ay,ay,hsquared-axsquared) + fma(h,h,-hsquared) - fma(ax,ax,-axsquared))/(2*h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also just use muladd here, since it will use fma internally if Base.Math.FMA_NATIVE is true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly calling fma here seems better: the branch is actually correct with fma, regardless of whether we're running on an FMA-native system or not, it's just that the performance will be bad on a non-FMA-native system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly true in light of the fact that in the work leading up to my original merge on hypot we discovered that not all muladds were being converted to fma. That is why I explicitly used fma here.

delta = h-ay
h -= muladd(delta,delta-2*(ax-ay),ax*(2*delta - ax))/(2*h)
if Base.Math.FMA_NATIVE
hsquared = h*h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here explaining the reason for this branch.

# correctly rounded variant when hardware fma is available, preserves performance or similar...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion. I will make an appropriate change

@cfborges
Copy link
Contributor Author

And thanks to all for the various bits of help on github use. I just had everything messed up with our network file system and there was ensuing confusion over what was where. Turns out the command I really needed to run was:

git force pull head<ass

@cfborges
Copy link
Contributor Author

@musm @StefanKarpinski @simonbyrne
My one remaining concern is that Base.Math.FMA_NATIVE is not type dependent and only appears to check if there is a native FMA for Float64. So in the case of say Float32 arguments to hypot this new branch might cause a software FMA emulation. Would it be safer to make the conditional

if Base.Math.FMA_NATIVE & T==Float64

Or should Base.Math.FMA_NATIVE be changed to be type specific?

@simonbyrne
Copy link
Contributor

In practice they will be the same, but you could define something like:

isfmanative(::Type{T}) where {T<:AbstractFloat} = muladd(nextfloat(one(T)),nextfloat(one(T)),-nextfloat(one(T),2)) != 0

@cfborges
Copy link
Contributor Author

In practice they will be the same, but you could define something like:

isfmanative(::Type{T}) where {T<:AbstractFloat} = muladd(nextfloat(one(T)),nextfloat(one(T)),-nextfloat(one(T),2)) != 0

If you're not worried about it, then I'm not worried about it. I'll just leave it as is since on my machine which does have a native FMA if I put Float32s through they go through that branch and still come out as Float32.

@musm
Copy link
Contributor

musm commented Jun 27, 2019

merge?

@StefanKarpinski
Copy link
Member

@simonbyrne, if it looks good to you, would you merge?

@simonbyrne simonbyrne merged commit 710e43d into JuliaLang:master Jul 1, 2019
@cfborges
Copy link
Contributor Author

cfborges commented Jul 2, 2019

Cool. Two things.

First the original version of hypot is still there. I'll let someone more experienced than myself remove it since I have previously demonstrated the ability to mess that up.

Second, as far as I can tell in all my research Julia is now the first language to support a correctly rounded hypot even though this has been a recommended correctly rounded function in the IEEE754 standard since 1985.

@cfborges cfborges deleted the hypotcr branch July 2, 2019 15:41
@StefanKarpinski
Copy link
Member

Sweet!

@KristofferC
Copy link
Member

Just pointing out that the correctly rounded hypot code is in practice never used because of Base.Math.FMA_NATIVE test evaluates to false where Julia is built and has nothing to do with the native capabilities (see #33011).

@expikr
Copy link

expikr commented Mar 31, 2024

@cfborges I'm currently implementing this algorithm in the Zig std and I have a few questions:

  1. in the zip provided in the Supplemental Material section of the published version (https://dl.acm.org/doi/10.1145/3428446) the code was quite different from the version enclosed in the paper's appendix. Were those the more update or outdated version? The version inside the zip had an additional correction term, I'm assuming that this was because that version did not use a fused muladd inside the sqrt?
  2. Is there any significant numerical reasons for -=ing the correction term rather than returning hypot - correction as an expression? Is it specific to Julia?

@oscardssmith
Copy link
Member

the a-=b in Julia just is short for a=a-b. I'm not sure about your first point.

@cfborges
Copy link
Contributor Author

As to item #2 the comment by oscardsmith is correct.

As to item #1. The code in the Julia library is the corrected (fused) algorithm from the TOMS paper. It is correctly rounded for all single precision arguments but fails in one known case for double (see section 6 of the paper for that case, or reference 11 in the paper for a compilation of the cases that are known to be most difficult to round). That singular case could be corrected with a single exception for double but then there could be other problems in other notional precisions. The final algorithm in the TOMS paper corrects that. That algorithm guarantees a correctly rounded answer in any precision for a binary FPS that is IEEE compliant. I did not add that to the Julia library as it adds a substantial cost to correct a single error that is of almost no consequence, and there were already concerns about the cost of this algorithm.

Note that, as is made clear in the paper, the corrected (fused ) algorithm can only fail to give a correctly rounded answer when the true answer lies almost exactly at the midpoint (see reference 11 for a collection of these). It is ALWAYS faithful.

@expikr
Copy link

expikr commented Apr 1, 2024

Thanks, actually I just meant if there were any numerical precision reasons or Julia-specific optimization for mutating the variable rather than just doing an early return with the complete expression. I guess it's just coding style/preference?

So to confirm, whatever is in the Julia repository is the most up-to-date version of the first order corrected fma algorithm?

For reference, this is what's in the zip:

function MyHypot4(x::T,y::T) where T<:AbstractFloat  # Corrected (Fused)

    ... # the inf/nan checks and prescaling omitted for brevity

    # Compute the first order terms
    x_sq = x*x
    y_sq = y*y
    sigma = x_sq+y_sq
    h = sqrt(sigma)

    # Compute tau using higher order terms and apply the correction
    sigma_e = y_sq - (sigma-x_sq)
    tau = fma(y,y,-y_sq) + fma(x,x,-x_sq) + sigma_e + fma(-h,h,sigma)
    h = fma(tau/h,.5,h)
    h*scale
end

The main difference being the extra sigma_e correction, the lack of fma in computing the square sum, and the use of fma in the error subtraction.

This is how I'm writing the latter part in zig:

inline fn hypotFused(comptime F: type, x: F, y: F) F {
    const r = @sqrt(@mulAdd(F, x, x, y * y));
    const rr = r * r;
    const xx = x * x;
    const dd = @mulAdd(F, -y, y, rr - xx) + @mulAdd(F, r, r, -rr) - @mulAdd(F, x, x, -xx);
    return r - dd / (2 * r);
}

@StefanKarpinski
Copy link
Member

Does your @mulAdd operation do fused multiply add?

@cfborges
Copy link
Contributor Author

cfborges commented Apr 1, 2024

The structure of my original code is critical. Rounding needs to happen at certain points and in precise ways. I use assignments to make that clear. Any intermediate results done in a temporary extended precision would invalidate the assumptions that lead to an accurate result. And the use of a correctly rounded fma is absolutely critical. This can be faked on machines that do not provide a hardware fma but it must behave the same way (i.e. it must yield the correctly rounded value of a*b+c.)

@expikr
Copy link

expikr commented Apr 2, 2024

Using Godbolt it seems that identical assembly is generated whether the in-place subtraction is used or if just returns the subtraction as an expression, both for Julia and for Zig.

Julia Godbolt

Zig Godbolt

Compared below are the generated assembly following the current Julia implementation:

Julia Asm
julia_hypot_236:                        # @julia_hypot_236
        push    rbp
        vmulsd  xmm2, xmm0, xmm0
        vmovapd xmm3, xmm1
        mov     rbp, rsp
        vfmadd213sd     xmm3, xmm1, xmm2        # xmm3 = (xmm1 * xmm3) + xmm2
        vsqrtsd xmm3, xmm3, xmm3
        vfmsub213sd     xmm0, xmm0, xmm2        # xmm0 = (xmm0 * xmm0) - xmm2
        vmulsd  xmm4, xmm3, xmm3
        vsubsd  xmm5, xmm4, xmm2
        vfmsub231sd     xmm4, xmm3, xmm3        # xmm4 = (xmm3 * xmm3) - xmm4
        vaddsd  xmm2, xmm3, xmm3
        vfnmadd231sd    xmm5, xmm1, xmm1        # xmm5 = -(xmm1 * xmm1) + xmm5
        vaddsd  xmm1, xmm4, xmm5
        vsubsd  xmm0, xmm1, xmm0
        vdivsd  xmm0, xmm0, xmm2
        vsubsd  xmm0, xmm3, xmm0
        pop     rbp
        ret
Zig Asm
zig_hypot:
        vmulsd  xmm2, xmm1, xmm1
        vfmadd231sd     xmm2, xmm0, xmm0
        vsqrtsd xmm2, xmm2, xmm2
        vmulsd  xmm3, xmm2, xmm2
        vmulsd  xmm4, xmm0, xmm0
        vsubsd  xmm5, xmm3, xmm4
        vfnmadd231sd    xmm5, xmm1, xmm1
        vfmsub231sd     xmm3, xmm2, xmm2
        vaddsd  xmm1, xmm3, xmm5
        vfmsub213sd     xmm0, xmm0, xmm4
        vsubsd  xmm0, xmm1, xmm0
        vaddsd  xmm1, xmm2, xmm2
        vdivsd  xmm0, xmm0, xmm1
        vsubsd  xmm0, xmm2, xmm0
        ret

And the use of a correctly rounded fma is absolutely critical. This can be faked on machines that do not provide a hardware fma but it must behave the same way (i.e. it must yield the correctly rounded value of a*b+c.)

Zig emulates fma on unsupported hardware so it would be a compiler bug if the rounding isn't identical to hardware fma, so I think that should be fine there?

The structure of my original code is critical. Rounding needs to happen at certain points and in precise ways. I use assignments to make that clear. Any intermediate results done in a temporary extended precision would invalidate the assumptions that lead to an accurate result.

So if I'm understanding correctly from this, the TOMS version of MyHypot4 being different from Julia's current implementation arises from correcting for the usage of pre-sqrt result instead of current implementation's squared post-sqrt result as h_squared? Regardless, the current implementation's correction is matched relative to its rounding steps, I trust?

@StefanKarpinski
Copy link
Member

In-place subtraction is merely syntax in Julia, so that's unsurprising.

Zig emulates fma on unsupported hardware so it would be a compiler bug if the rounding isn't identical to hardware fma, so I think that should be fine there?

This is why I was asking about what @mulAdd semantically means. To be clear, in Julia, fma does mandatory fused multiply add, emulated if necessary, while muladd does whichever is faster: fma or mul and then add.

@cfborges
Copy link
Contributor Author

cfborges commented Apr 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants