Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Levin transform in Complex airy functions (again) #95

Merged
merged 19 commits into from
Apr 28, 2023

Conversation

heltonmc
Copy link
Member

Supersedes #94 which includes the rebase.

@codecov
Copy link

codecov bot commented Apr 24, 2023

Codecov Report

Patch coverage: 96.91% and project coverage change: +0.53 🎉

Comparison is base (cec8fce) 96.09% compared to head (ab50225) 96.63%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
+ Coverage   96.09%   96.63%   +0.53%     
==========================================
  Files          20       23       +3     
  Lines        2228     2377     +149     
==========================================
+ Hits         2141     2297     +156     
+ Misses         87       80       -7     
Impacted Files Coverage Δ
src/AiryFunctions/airy.jl 99.56% <ø> (ø)
src/BesselFunctions/Float128/besselj.jl 98.07% <ø> (ø)
src/BesselFunctions/Float128/bessely.jl 96.07% <ø> (ø)
src/BesselFunctions/Polynomials/besselj_polys.jl 100.00% <ø> (ø)
src/BesselFunctions/U_polynomials.jl 99.28% <ø> (ø)
src/BesselFunctions/asymptotics.jl 99.48% <ø> (ø)
src/BesselFunctions/besseli.jl 99.00% <ø> (ø)
src/BesselFunctions/besselj.jl 98.31% <ø> (ø)
src/BesselFunctions/bessely.jl 97.95% <ø> (ø)
src/BesselFunctions/constants.jl 100.00% <ø> (ø)
... and 13 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/Airy/cairy.jl Outdated Show resolved Hide resolved
src/Airy/cairy.jl Outdated Show resolved Hide resolved
src/Airy/cairy.jl Outdated Show resolved Hide resolved
src/Airy/cairy.jl Outdated Show resolved Hide resolved
src/Airy/cairy.jl Outdated Show resolved Hide resolved
src/Airy/cairy.jl Outdated Show resolved Hide resolved
@heltonmc
Copy link
Member Author

I still need a little bit of work on this but I've added the Levin transform for the airybi functions. This function is more difficult than airyai (which is difficult in itself) as the asymptotic expansion is a combination of two series.

I made the initial mistake of trying to combine these two series within the transform which led to errors. Therefore, it is required that the transform be performed individually on the two series and then combined after the transformation. This was a bit subtle at first why this was case but it is necessary. Essentially, the other way to go about it is to treat them as individual terms within the series (don't combine them) but that requires essentially double the amount of terms to do it individually.

In the case of the Levin transform it is more optimal to compute two series of length N/2 than it is to compute one series of length N. This is because the amount of computations scales non-linearly with the length of the series. The function itself right now is kind of hacked together but the initial computation of the partial sums of the two series should be able to be computed simultaneously. I'll work on that in the next few days after I take a little break looking at it.

The advantages are now many fold: 1. These are much more accurate than previous version, 2. They are much faster, 3. they completely separate the airy function computation from the bessel function routines (therefore we can separate it well into its own module).

@heltonmc
Copy link
Member Author

There is a couple things to figure out that as well. I still need to work on a Float32 routine and get that ironed out. We should only need the power series and asymptotic expansion. I think it might be confusing to readers if we have a primary routine that dispatch on the cutoff parameters but that some of the dispatches are never hit by the Float32 routine as the first two branches cover all the cases.It is also the case right now that it was easier to design the cutoff for the Levin case before the power series case but that could be switched.

And I need to figure out what to do with the large negative values. This will still suffer from same issues on real line.

I don't really like throwing an error personally. I would prefer NaN (like SciPy) or extensively document the expected errors and return value with loss of precision (Boost Math Library). It seems more natural than to throw an error then return a value for -Inf. It also hurts the purity and ability to statically compile these functions with StaticCompiler (my primary interest with some of these routines)...

src/Airy/cairy.jl Outdated Show resolved Hide resolved
@heltonmc
Copy link
Member Author

heltonmc commented Apr 26, 2023

Still haven't figured out a good way to combine the above. But I noticed a large part of the runtime is actually in generating the partial sums while the levin transform (levin_transform(l)) takes like 10% of the runtime. At least for small to medium values (N<22). It of course grows nonlinearly but we are only worried about around N=16.

So in that case it may be more optimal to focus on optimizing the sequence generation. The issue of course is having a complex divide (which I've positioned to a single fast math inv) but even still that is slow. It is possible though to rearrange the formula to avoid the complex division at the cost of a complex - real multiply and a complex-complex multiply. In general complex divides are ridiculously slow (I don't have an exact number on the equivalent complex multiplies) so should be avoided at all costs.

In any case... it is opportune to write the sequence generation as...

@generated function airyaiprimex2_levin(x::Complex{T}, ::Val{N}) where {T <: Union{Float32, Float64}, N}
    :(
        begin
            xsqr = sqrt(x)
            xsqrx = xsqr * x
            t = -GAMMA_ONE_SIXTH(T) * GAMMA_FIVE_SIXTHS(T) / 4
            t2 = 1 / t
            a = @fastmath inv(4*xsqrx)
            a2 = 4 * xsqrx

            s = zero(typeof(x))
            l = @ntuple $N i -> begin
                s += t
                t *= -a * (3 * (i - 7//6) * (i + 1//6) / i)
                t2 *= -a2 * (i / (3 * (i - 7//6) * (i + 1//6)))
                Vec{4, T}((reim(s * t2)..., reim(t2)...))
            end
            return levin_transform(l) * sqrt(xsqr) / T(π)^(3//2)
        end
    )
end

So we now have two accumulators that keep track of the series terms and inverse series terms. Here are the benchmarks.

julia> @benchmark Bessels.airyaiprimex2_levin(z, Val(16)) setup=(z=rand() + rand()*im)
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
 Range (min  max):  109.218 ns  164.520 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     109.844 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   110.603 ns ±   3.260 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅█▁      ▄▃▂                                                ▂
  ██████▆▆▆▆████▆▆▆▇▇▇▃▃▁▃▅▆▅▆▆▄▆▃▅▆▅▃▄▃▅▄▅▅▆▄▃▄▅▆▅▄▅▅▆▅▆▅▅▆▄▆▆ █
  109 ns        Histogram: log(frequency) by time        127 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Bessels.airyaiprimex_levin(z, Val(16)) setup=(z=rand() + rand()*im)
BenchmarkTools.Trial: 10000 samples with 839 evaluations.
 Range (min  max):  145.262 ns  209.377 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     149.484 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.999 ns ±   4.852 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▇▆     ▃▂ █▂     ▃▂                                          ▁
  ███▆▆▇▅▅██████▇▇▆▆██▅▆▆▇▆▄▄▅▅▄▅▄▅▃▅▅▃▃▅▄▄▄▄▅▄▂▅▅▆▅▄▆▆▄▅▄▅▃▄▄▅ █
  145 ns        Histogram: log(frequency) by time        172 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

That reduces the time significantly and we also get to avoid any fastmath flags in the loop. I'm looking at the machine code though and it is still not optimal unfortunately. It should now be possible to vectorize the t and t2 accumulation. The auto-vectorizer is doing an ok job but there is still scalar code being generated within this loop. It might be possible to further reduce this comptuational time.

I worry a little bit that this is loosing the meaning of the original formulas making a little more difficult to read. But........ 🤷‍♂️ it also looks like it will be possible to speed this up significantly so it might be worth it. I just need to extensively whats going on....

@cgeoga
Copy link
Contributor

cgeoga commented Apr 26, 2023

Taking a look at the source of this PR as I play with the branch---just as a sanity check, do you mean for besselkx_levin to live in the math.jl file?

@heltonmc
Copy link
Member Author

do you mean for besselkx_levin to live in the math.jl file?

No - just used that as an initial reference. I've moved it and added tests for those functions just to make sure they are working. I haven't made the optimizations but they are still reasonably fast right now.

Since this PR is focused on airy functions I'm going to finish those off and merge then I'll work more on the besselk case but you should be able to use these. I removed any premature breaking from them!

@heltonmc
Copy link
Member Author

heltonmc commented Apr 26, 2023

Alright this finally completes the vision of having everything that doesn't depend on each other in separate modules. If someone in the future wants to split these off into separate packages that would be fine with me I think but for now this is good.

Supersedes #84. For example, now the AiryFunctions package just depends on the Math submodule which contains the math constants, levin transform, and reexports what we need from SIMDMath.

I'm sure there are some things to sort out which I will do depending on how we want to precompile some of the methods. Should precompile statements happen at the top module level or within each submodule that are then reexported? Hoping to take full advantage of v1.9 coming soon.

Still have several things to order out in the airy functions though now.

Unsure what's going on with the invalidation CI. I think it's because of the recent movements with SnoopCompile and PrecompileTools?

@heltonmc
Copy link
Member Author

A little concerned that the testing times for airy function module has increased from 3s to 7s. Though locally this still takes around 3s. I'm wondering if through CI there is a high cost of compilation that isn't being cached appropriately. I ran a few benchmarks locally and can't reproduce so I think that is ok.

@heltonmc
Copy link
Member Author

Alright I think this is ready to merge now to make way for #96

@heltonmc heltonmc merged commit 152771d into JuliaMath:master Apr 28, 2023
@heltonmc heltonmc deleted the airy_levin2 branch April 28, 2023 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants