-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Levin transform in Complex airy functions (again) #95
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #95 +/- ##
==========================================
+ Coverage 96.09% 96.63% +0.53%
==========================================
Files 20 23 +3
Lines 2228 2377 +149
==========================================
+ Hits 2141 2297 +156
+ Misses 87 80 -7
☔ View full report in Codecov by Sentry. |
I still need a little bit of work on this but I've added the Levin transform for the I made the initial mistake of trying to combine these two series within the transform which led to errors. Therefore, it is required that the transform be performed individually on the two series and then combined after the transformation. This was a bit subtle at first why this was case but it is necessary. Essentially, the other way to go about it is to treat them as individual terms within the series (don't combine them) but that requires essentially double the amount of terms to do it individually. In the case of the Levin transform it is more optimal to compute two series of length The advantages are now many fold: 1. These are much more accurate than previous version, 2. They are much faster, 3. they completely separate the airy function computation from the bessel function routines (therefore we can separate it well into its own module). |
There is a couple things to figure out that as well. I still need to work on a And I need to figure out what to do with the large negative values. This will still suffer from same issues on real line. I don't really like throwing an error personally. I would prefer NaN (like SciPy) or extensively document the expected errors and return value with loss of precision (Boost Math Library). It seems more natural than to throw an error then return a value for -Inf. It also hurts the purity and ability to statically compile these functions with StaticCompiler (my primary interest with some of these routines)... |
Still haven't figured out a good way to combine the above. But I noticed a large part of the runtime is actually in generating the partial sums while the levin transform ( So in that case it may be more optimal to focus on optimizing the sequence generation. The issue of course is having a complex divide (which I've positioned to a single fast math In any case... it is opportune to write the sequence generation as... @generated function airyaiprimex2_levin(x::Complex{T}, ::Val{N}) where {T <: Union{Float32, Float64}, N}
:(
begin
xsqr = sqrt(x)
xsqrx = xsqr * x
t = -GAMMA_ONE_SIXTH(T) * GAMMA_FIVE_SIXTHS(T) / 4
t2 = 1 / t
a = @fastmath inv(4*xsqrx)
a2 = 4 * xsqrx
s = zero(typeof(x))
l = @ntuple $N i -> begin
s += t
t *= -a * (3 * (i - 7//6) * (i + 1//6) / i)
t2 *= -a2 * (i / (3 * (i - 7//6) * (i + 1//6)))
Vec{4, T}((reim(s * t2)..., reim(t2)...))
end
return levin_transform(l) * sqrt(xsqr) / T(π)^(3//2)
end
)
end So we now have two accumulators that keep track of the series terms and inverse series terms. Here are the benchmarks. julia> @benchmark Bessels.airyaiprimex2_levin(z, Val(16)) setup=(z=rand() + rand()*im)
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
Range (min … max): 109.218 ns … 164.520 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 109.844 ns ┊ GC (median): 0.00%
Time (mean ± σ): 110.603 ns ± 3.260 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅█▁ ▄▃▂ ▂
██████▆▆▆▆████▆▆▆▇▇▇▃▃▁▃▅▆▅▆▆▄▆▃▅▆▅▃▄▃▅▄▅▅▆▄▃▄▅▆▅▄▅▅▆▅▆▅▅▆▄▆▆ █
109 ns Histogram: log(frequency) by time 127 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Bessels.airyaiprimex_levin(z, Val(16)) setup=(z=rand() + rand()*im)
BenchmarkTools.Trial: 10000 samples with 839 evaluations.
Range (min … max): 145.262 ns … 209.377 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 149.484 ns ┊ GC (median): 0.00%
Time (mean ± σ): 148.999 ns ± 4.852 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇▆ ▃▂ █▂ ▃▂ ▁
███▆▆▇▅▅██████▇▇▆▆██▅▆▆▇▆▄▄▅▅▄▅▄▅▃▅▅▃▃▅▄▄▄▄▅▄▂▅▅▆▅▄▆▆▄▅▄▅▃▄▄▅ █
145 ns Histogram: log(frequency) by time 172 ns <
Memory estimate: 0 bytes, allocs estimate: 0. That reduces the time significantly and we also get to avoid any fastmath flags in the loop. I'm looking at the machine code though and it is still not optimal unfortunately. It should now be possible to vectorize the I worry a little bit that this is loosing the meaning of the original formulas making a little more difficult to read. But........ 🤷♂️ it also looks like it will be possible to speed this up significantly so it might be worth it. I just need to extensively whats going on.... |
Taking a look at the source of this PR as I play with the branch---just as a sanity check, do you mean for |
No - just used that as an initial reference. I've moved it and added tests for those functions just to make sure they are working. I haven't made the optimizations but they are still reasonably fast right now. Since this PR is focused on airy functions I'm going to finish those off and merge then I'll work more on the |
Alright this finally completes the vision of having everything that doesn't depend on each other in separate modules. If someone in the future wants to split these off into separate packages that would be fine with me I think but for now this is good. Supersedes #84. For example, now the AiryFunctions package just depends on the Math submodule which contains the math constants, levin transform, and reexports what we need from SIMDMath. I'm sure there are some things to sort out which I will do depending on how we want to precompile some of the methods. Should precompile statements happen at the top module level or within each submodule that are then reexported? Hoping to take full advantage of v1.9 coming soon. Still have several things to order out in the airy functions though now. Unsure what's going on with the invalidation CI. I think it's because of the recent movements with SnoopCompile and PrecompileTools? |
A little concerned that the testing times for airy function module has increased from 3s to 7s. Though locally this still takes around 3s. I'm wondering if through CI there is a high cost of compilation that isn't being cached appropriately. I ran a few benchmarks locally and can't reproduce so I think that is ok. |
Alright I think this is ready to merge now to make way for #96 |
Supersedes #94 which includes the rebase.