Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make abs, abs_imag, inv, and / resistant to under/overflow #122

Merged
merged 24 commits into from
Jan 9, 2023
Merged

Conversation

sethaxen
Copy link
Collaborator

@sethaxen sethaxen commented Dec 10, 2022

Currently abs and abs_imag effectively use abs2 under the hood. For very small numbers, this causes underflow, and for very large numbers, this causes overflow. In base. abs(z::Complex) uses hypot, which takes care to avoid under/overflow. This PR uses hypot in abs and abs_imag.

Before Julia 1.9, hypot for more than 3 args was unbearably slow (JuliaLang/julia#44336), so abs is only changed for 1.9 and later.

There is a performance cost. On v1.9-alpha1:

julia> using Quaternions, BenchmarkTools

julia> q = randn(QuaternionF64);

julia> q = quat(1.3, 5.6, -3.3, 0.9);

julia> q2 = quat(fill(1e-300, 4)...);

On main

julia> @btime abs($q);
  1.927 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q);
  1.922 ns (0 allocations: 0 bytes)

julia> @btime abs($q2);
  1.922 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q2);
  1.922 ns (0 allocations: 0 bytes)

this PR

julia> @btime abs($q);
  8.471 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q);
  25.249 ns (0 allocations: 0 bytes)

julia> @btime abs($q2);
  8.859 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q2);
  25.240 ns (0 allocations: 0 bytes)

I have no idea why after this PR abs_imag is more expensive than abs, when it does fewer operations.

julia> @code_typed Quaternions.abs(q)
CodeInfo(
1%1 = Base.getfield(q, :s)::Float64%2 = Base.getfield(q, :v1)::Float64%3 = Base.getfield(q, :v2)::Float64%4 = Base.getfield(q, :v3)::Float64%5 = Core.tuple(%1, %2, %3, %4)::NTuple{4, Float64}%6 = invoke Base.Math._hypot(%5::NTuple{4, Float64})::Float64
└──      return %6
) => Float64

julia> @code_typed Quaternions.abs_imag(q)
CodeInfo(
1%1 = Base.getfield(q, :v1)::Float64%2 = Base.getfield(q, :v2)::Float64%3 = Base.getfield(q, :v3)::Float64%4 = Core.tuple(%1, %2, %3)::Tuple{Float64, Float64, Float64}%5 = invoke Base.Math._hypot(%4::Tuple{Float64, Float64, Float64})::Float64
└──      return %5
) => Float64

@codecov
Copy link

codecov bot commented Dec 10, 2022

Codecov Report

Merging #122 (347c880) into main (7e0efc6) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #122   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines          137       164   +27     
=========================================
+ Hits           137       164   +27     
Impacted Files Coverage Δ
src/Quaternion.jl 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@stevengj
Copy link
Contributor

I would argue that you should prioritize correctness over performance.

It's better if people using older versions should still get correct results, even if it is slower (abs is not often performance-critical anyway?), than that users of different Julia versions should get completely different answers.

@stevengj
Copy link
Contributor

stevengj commented Dec 15, 2022

If the performance is a real worry, you can always backport the Base code to earlier versions:

if VERSION < v"1.9" # backport code from julia#44357
    function _hypot(x::NTuple{N,<:Number}) where {N}
        maxabs = maximum(abs, x)
        if isnan(maxabs) && any(isinf, x)
            return typeof(maxabs)(Inf)
        elseif (iszero(maxabs) || isinf(maxabs))
            return maxabs
        else
            return maxabs * sqrt(sum(y -> abs2(y / maxabs), x))
        end
     end
     Base.abs(q::Quaternion) = _hypot((real(q), imag_part(q)...))
end

@sethaxen
Copy link
Collaborator Author

I would argue that you should prioritize correctness over performance.

Agreed!

abs is not often performance-critical anyway?

One of the main uses of this package is unit quaternions for rotations, which are normalized using abs, so this is an operation we'd like to keep performant if possible.

If the performance is a real worry, you can always backport the Base code to earlier versions:

Good idea! I'll adapt.

@stevengj can you think of a reason why this would be the case? 👇

I have no idea why after this PR abs_imag is more expensive than abs, when it does fewer operations.

@stevengj
Copy link
Contributor

stevengj commented Dec 16, 2022

You might need to suppress compiler inlining and constant propagation with

@btime abs($(Ref(q))[]);

as explained in the BenchmarkTools manual.

@sethaxen
Copy link
Collaborator Author

You might need to suppress compiler inlining and constant propagation with

@btime abs($(Ref(q))[]);

as explained in the BenchmarkTools manual.

Doesn't seem to make a difference:

julia> using Quaternions, BenchmarkTools

julia> q = randn(QuaternionF64)
QuaternionF64(-0.06595820195841512, 0.2980891652137365, 0.1349996383131589, -0.14878742850114615)

julia> @btime abs($(Ref(q))[]);
  8.439 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($(Ref(q))[]);
  25.225 ns (0 allocations: 0 bytes)

@stevengj
Copy link
Contributor

If you can reproduce it just by hypot(3.,4.,5.,6.) vs hypot(3.,4.,5.), it might be worth filing an issue.

@mikmoore
Copy link
Contributor

mikmoore commented Jan 3, 2023

If performance is a concern, we could do things like compute abs2 and use that if it's floatmin/eps < abs2(q) < floatmax, only falling back to hypot when this fails (but that check also adds a cost). But it'd be nicer if we didn't have this special casing here and simply could call hypot always.

I went wild trying to optimize Base.Math._hypot on IEEEFloat inputs. I pulled out all the stops with bit-twiddling and this is the best I came up with:

__maxabs_pos(x::NTuple{N,T}) where {N,T<:Base.IEEEFloat} = reinterpret(T, maximum(z -> reinterpret(Signed, z), x)) # specialization of maximum(x) for positive-signed x
__prevpow2(x::T) where T<:Base.IEEEFloat = reinterpret(T, reinterpret(Unsigned, x) & ~Base.significand_mask(T)) # truncate to a power of 2 # DOES NOT WORK FOR SUBNORMALS
__invpow2(x::T) where T<:Base.IEEEFloat = reinterpret(T, reinterpret(Unsigned, __prevpow2(floatmax(T))) - reinterpret(Unsigned, x)) # invert a power of 2 # DOES NOT WORK FOR NEGATIVES, SUBNORMALS, OR VERY LARGE VALUES

function myhypot(x::NTuple{N,T}) where {N,T<:Base.IEEEFloat}
	__min(a, b) = ifelse(a < b, a, b)
	__max(a, b) = ifelse(a > b, a, b)
	x = map(abs, x)
	any(==(convert(T, Inf)), x) && return convert(T, Inf)
	maxabs = __maxabs_pos(x) # == maximum(abs, x)
	m = __max(floatmin(T), __min(inv(floatmin(T)), __prevpow2(maxabs))) # something roughly the size of maxabs that is a power of 2
	mi = __invpow2(m) # marginally faster than inv
	return m * sqrt(sum(y -> abs2(y * mi), x))
end

I think I got all the zero/Inf/NaN behavior correct but I haven't run a full test suite. Compared to the version provided above, this brought the runtime down from 8.2->6.4ns for 3-tuples, 8.4->6.5ns for 4-tuples and 41.3->17.6ns for 16-tuples. Meanwhile, sqrt(sum(abs2,x)) is more like 3.0ns for 3/4-tuples and 5.8ns for 16-tuples.

This may not be closing the gap as much as we'd hope, but it's maybe the fastest we can do this robustly. Really, parts of this implementation may be too involved for actual use -- I was just trying to get a sense for how fast it could be done.

@sethaxen
Copy link
Collaborator Author

sethaxen commented Jan 7, 2023

Thanks @mikmoore for looking into this! I agree it is much better if this is solved upstream in hypot. I opened an issue in JuliaLang/julia#47917 and so far no one seems to have found the culprit.

abs_imag being slow only impacts the extensions of the analytic functions. The ones most likely to be called are exp and log, which are ~3x slower with this PR.

I suggest either we merge this PR as a whole, asserting that correctness is more important than speed (and hoping the issue is eventually resolved upstream) or we at least merge the abs part and perhaps save abs_imag for later. Thoughts?

@mikmoore
Copy link
Contributor

mikmoore commented Jan 7, 2023

The simplest thing (and my thought) is to use hypot and wait for the upstream fixes to resolve the performance. But I have never used analytic functions of quaternions so I'm not suffering the performance penalty here. My anticipation is that the majority of users are also in that situation. Of the ones that do use them, most probably haven't and won't notice the performance difference as I don't think these often end up in hot loops.

For what it's worth, inv and / have the same over/underflow hazard and we've yet to fix them or even open an issue. Those are more likely to be encountered than issues with analytic functions, I'd venture.

I opened JuliaLang/julia#48130, which should hopefully improve hypot performance a little more yet. It appears to be about 1ns slower than what I posted above but is marginally less prone to overflow and much easier to read/maintain. In that PR, I failed to reproduce JuliaLang/julia#47917 so it seems that someone has already fixed it for v1.10.

@sethaxen sethaxen changed the title Make abs and abs_imag resistant to under/overflow Make abs, abs_imag, inv, and / resistant to under/overflow Jan 8, 2023
@sethaxen
Copy link
Collaborator Author

sethaxen commented Jan 8, 2023

Since we would need to replicate _hypot anyways, I opted to just not use hypot and instead roll our own function that mimics its logic but avoids working with the NTuple. This seems to be faster (see above and below benchmarks). Doing the same for abs_imag, the slowdown we saw with hypot completely disappears. Now these are fast for all Julia versions.

For what it's worth, inv and / have the same over/underflow hazard and we've yet to fix them or even open an issue.

I expanded this PR to also handle under/overflow for inv and /. The implementations follow the corresponding ones for Complex. There are more improvements we could do for QuaternionF64 and Quaternion32, but these generic ones cover most cases already.

Here are the new benchmarks:

On v1.9-beta2:

julia> using Quaternions, BenchmarkTools, Random

julia> Random.seed!(0);

julia> q1, q2 = randn(QuaternionF64, 2);

Release:

julia> using Quaternions, BenchmarkTools, Random

julia> Random.seed!(0);

julia> q1, q2 = randn(QuaternionF64, 2);

julia> @btime abs($q1);
  1.683 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q1);
  1.918 ns (0 allocations: 0 bytes)

julia> @btime inv($q1);
  3.132 ns (0 allocations: 0 bytes)

julia> @btime $q1 / $q2;
  5.003 ns (0 allocations: 0 bytes)

This PR:

julia> @btime abs($q1);
  6.979 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q1);
  5.346 ns (0 allocations: 0 bytes)

julia> @btime inv($q1);
  9.164 ns (0 allocations: 0 bytes)

julia> @btime $q1 / $q2;
  15.030 ns (0 allocations: 0 bytes)

Here I dicuss a speed-up that I've opted not to use. We can get about a 2-fold speed-up by normalizing sum(abs, ...) instead of max(abs, ...), as suggested in https://doi.org/10.1109/ARITH48897.2020.00016.

julia> @btime abs($q1);
  3.807 ns (0 allocations: 0 bytes)

julia> @btime $(Quaternions.abs_imag)($q1);
  3.600 ns (0 allocations: 0 bytes)

julia> @btime inv($q1);
  6.662 ns (0 allocations: 0 bytes)

julia> @btime $q1 / $q2;
  7.638 ns (0 allocations: 0 bytes)

With this choice, the tests included in this PR still pass. The trade-off is that we will overflow for cases where the sum of elements exceeds the maximum floating point number when none of the individual elements do.

e.g. with max

julia> abs(quat(1e308, 1e308, 0, 0))
1.4142135623730951e308

with sum

julia> abs(quat(1e308, 1e308, 0, 0))
Inf

@sethaxen sethaxen requested review from hyrodium and removed request for hyrodium January 8, 2023 23:58
Copy link
Collaborator

@hyrodium hyrodium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the implementation without hypot, and most of the changes look good to me!
I added some minor suggestions for docstring and test code.

test/Quaternion.jl Outdated Show resolved Hide resolved
test/Quaternion.jl Outdated Show resolved Hide resolved
src/Quaternion.jl Show resolved Hide resolved
sethaxen and others added 2 commits January 9, 2023 09:24
Co-authored-by: Yuto Horikawa <hyrodium@gmail.com>
Co-authored-by: Yuto Horikawa <hyrodium@gmail.com>
test/Quaternion.jl Outdated Show resolved Hide resolved
Co-authored-by: Yuto Horikawa <hyrodium@gmail.com>
Copy link
Collaborator

@hyrodium hyrodium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants