-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A different approach to compute besselk in intermediate zones #74
Comments
@heltonmc, apologies---I somehow completely missed this when you posted it and am just seeing it now for the first time. I like this idea a lot and tried to think about some similar ideas to leverage the large-order expansions for these intermediate argument ranges, but I never really got anywhere. This is very nice. I didn't realize that the power series cutoff was I think I honestly need a little more time to grok why this works at all, though. I never really understood the Miller recursion in a very deep level and this whole concept of normalization is still a little shaky to me. But I'll definitely play with this and try to understand it, and it seems like a seriously exciting option. |
Not a problem!! I was working on this before I needed to switch to dissertation stuff so haven't come back to it. Hopefully will work on it a little more in the next couple of weeks. Yes - the power series for I also like the idea of reducing the number of algorithms employed such that if the uniform expansion is the main driver of the speed and accuracy of almost all arguments that is more easily optimized than several different routines. But ya, the Miller algorithm seems to be a pretty common way to generate higher order (integer orders) functions when forward recurrence is not stable. That way you can use an optimized zero order function to normalize for any order. This doesn't work as well for Anyway, still need to have the continued fraction approach for |
Was thinking about this approach for The first challenge is that it requires a rotation of argument by function besselk_ratio_knu_knup1(v, x::ComplexOrReal{T}) where T
MaxIter = 1000
S = eltype(x)
(hn, Dn, Cn) = (S(1e-50), zero(S), S(1e-50))
jf = one(T)
vv = v * v
@fastmath begin
for _ in 1:MaxIter
an = vv - (jf - 0.5)^2
bn = 2 * (x + jf)
Cn = an / Cn + bn
Dn = inv(muladd(an, Dn, bn))
del = Dn * Cn
hn *= del
checktol(del - 1) && break
jf += one(T)
end
xinv = (v + x + hn + 1/2) / x
end
return xinv
end
@inline checktol(err::T) where T = abs(err) < eps(Float64)
@inline checktol(err::Complex{T}) where T = abs2(err) < eps(T)^2 Perhaps, it could be sped up further but it is pretty fast already for the input range we need for julia> @btime besselk_ratio_knu_knup1(v, x) setup=(x= 20.0im; v = rand()*1)
76.246 ns (0 allocations: 0 bytes)
1.0000915788150249 - 0.045999085366594im
julia> @btime besselk_ratio_knu_knup1(v, x) setup=(x= 7.0im; v = rand()*1)
141.996 ns (0 allocations: 0 bytes)
1.001602364037009 - 0.11366618431295909im So a range between 76-142 nanoseconds in the regime we'd utilize this for even with complex arithmetic... Of course we also need recurrence up to higher order expansion (takes around 20 nanoseconds) and then need one large order evaluation... julia> @btime Bessels.besseljy_debye(y, x) setup=(x=10*rand()+12.0; y=rand()+60.0)
49.393 ns (0 allocations: 0 bytes)
(1.8133319226273097e-26, -3.0253831328694315e23) Which takes around 50 nanoseconds. So that puts the total time for this approach around 150-225 nanoseconds in a region where no expansion gives good results. Now that might not seem that fast but the existing approach using Chebyshev approximation is the longest approach we have... julia> @btime Bessels.bessely_chebyshev(y, x) setup=(x=12*rand()+6; y=rand()*20.0)
460.577 ns (0 allocations: 0 bytes) Which takes between 460-500 nanoseconds so the new approach would be about 2-3x faster. I think it might be worth taking a look at the whole routine to some degree and try to decrease the amount of algorithms to improve branch prediction. However, there was some discussion on discourse about this but it might be possible to improve the Chebyshev evaluation using SIMD.... I'd assume it would be possible to improve those times by 2-3x EditThe relation between So we could compute Using the relation Edit 2After some more consideration it might be best to compute Hankel function directly then just take Similar to what I showed above we can construct the solution to the Hankel function with the continued fraction and the solution for higher order term. # computes Hankel function
# compute bessely from imaginary part
function test(nu, x)
fpart, _ = modf(nu)
nu_ceil = 45.0 + fpart
knu_ceil = complex(Bessels.besseljy_debye(nu_ceil, x)...)
knup1 = -im * conj(Bessels.besselk_ratio_knu_knup1(nu, x*im))
return knu_ceil / Bessels.besselj_up_recurrence(x, knup1, one(knup1), nu + 1, nu_ceil)[1]
end
julia> test(2.5, 18.0)
0.11922846888645108 + 0.14657028641952727im
julia> besselj(2.5, 18.0)
0.11922846888645013
julia> bessely(2.5, 18.0)
0.1465702864195262 Even with the code base as is this is pretty fast. julia> @benchmark test(2.5, x) setup=(x=6.0 + 10.0*rand())
BenchmarkTools.Trial: 10000 samples with 690 evaluations.
Range (min … max): 181.341 ns … 285.023 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 183.636 ns ┊ GC (median): 0.00%
Time (mean ± σ): 189.049 ns ± 10.150 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▄█ ▃▁▃ ▄▆▁▆ ▂ ▁ ▃▁▃ ▂
████▆▆█████▇▇▄▁▄▄▃▄▁▅▁▄████▇▆█▇█▇▆▆███▇▄▅▆▆▅▄▅▄▃▄▁▄▁▃▁▁█▇▄███ █
181 ns Histogram: log(frequency) by time 225 ns <
Memory estimate: 0 bytes, allocs estimate: 0. So that's much faster than the current approach. I think some more care will help the convergence by only doing continued fraction for small orders and scaling from there. The question to me is the stability of the recurrence. For |
@cgeoga To continue this discussion a little bit because I think it would be really nice to round out the modified bessel functions completely. I've also been thinking a lot about how to ensure the implementations are differentiable. Enzyme has come a long way on the main branch and can give pretty fast derivatives of the code in Bessels.jl already. Anyway for computation of modified bessel functions (real arguments for now) there are really three main approaches in order of precedence.
Now the power series for
Those methods advantages and disadvantages were extensively discussed above but essentially method 1 converges faster for smaller arguments while method two converges faster for larger arguments. The slight disadvantage of method 2 is that the accuracy is tied to the uniform expansions which is slightly less accurate (due to roundoff) than the power series. Though approach 1 is not feasible for complex numbers. I'm going to introduce a different approach that could be a more robust approach meaning that it will work for complex numbers and be accurate. This was inspired by https://arxiv.org/pdf/1112.0072.pdf when I was working on the airy functions but approach three is:
At first I ignored these approaches because they honestly just didn't seem worth it and not efficient at all. Essentially they are able to sum a diverging asymptotic series and increase the validity of the asymptotic expansions. Computationally they can be computed using recursive schemes but they require significant more computation. Essentially you must compute a Levin table and compute down the counter diagonals of the Levin table. Now just thinking theoretically about this problem this requires at a minimum N*(N+1)/2 This is probably why I initially avoided this approach was that most of routines are under 100 ns and not only do you need all these arithmetics you also need to store tables and so your cache lines and memory accesses are honestly rubbish my first attempt at this using arrays took like 10 microseconds which is awful. But I want to come back to the performance after discussing where this is accurate. So here are a few plots of the relative errors as a function of So the really cool thing here is we are summing the asymptotic expansion for large Ok for the benchmarks it's hard to say as the general form was so slow but I completely unrolled a version (which is probably not what we want) for julia> p = ntuple(i -> i*2.0, 18)
(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0, 22.0, 24.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0)
julia> @benchmark g1($p)
BenchmarkTools.Trial: 10000 samples with 990 evaluations.
Range (min … max): 45.328 ns … 84.385 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.455 ns ┊ GC (median): 0.00%
Time (mean ± σ): 45.635 ns ± 1.419 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆▄▂ ▁ ▁
█████▄▁▁▁▁▁▁▃▁▁▁▃▁▄▁▃▃▁▃▅▃▁▁▁▁▃▃▁▄▁▁▁▁▃▇██▇▄▄▁▃▃▁▃▃▄▁▁▄▃▃▅▅ █
45.3 ns Histogram: log(frequency) by time 49.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0. And the advantage is this isn't awful for complex numbers because we have only real/complex multiplies and complex/complex additions. julia> p = ntuple(i -> i*2.0 + rand()*im, 18)
(2.0 + 0.20671806918378222im, 4.0 + 0.5544791253581327im, 6.0 + 0.8743485225845774im, 8.0 + 0.5842694164427652im, 10.0 + 0.5376242211139655im, 12.0 + 0.24931450428480173im, 14.0 + 0.7889137184169184im, 16.0 + 0.18129161121223591im, 18.0 + 0.9320649223876125im, 20.0 + 0.4467818886051006im, 22.0 + 0.44671130589145225im, 24.0 + 0.09229756644375375im, 26.0 + 0.09649719037936122im, 28.0 + 0.7052070903007531im, 30.0 + 0.8544141179769782im, 32.0 + 0.35520748181052775im, 34.0 + 0.22437122283477695im, 36.0 + 0.9058299917597746im)
julia> @benchmark g1($p)
BenchmarkTools.Trial: 10000 samples with 832 evaluations.
Range (min … max): 147.886 ns … 208.534 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 148.689 ns ┊ GC (median): 0.00%
Time (mean ± σ): 149.701 ns ± 3.756 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▇█▆▂ ▁▃▄▂▁▁ ▁▁ ▂
█████▇▇█▇▇██████████▇▆▅▆▆▅▆▆▇▆▅▅▅▃▆▆▆▄▅▄▅▅▄▅▁▅▆▅▃▅▆▆▆▄▆▅▅▅▆▄▅ █
148 ns Histogram: log(frequency) by time 168 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
const CC = ((1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0),
(0.5, 0.6, 0.6666666666666666, 0.7142857142857143, 0.75, 0.7777777777777778, 0.8, 0.8181818181818182, 0.8333333333333334, 0.8461538461538461, 0.8571428571428571, 0.8666666666666667, 0.875, 0.8823529411764706, 0.8888888888888888),
(0.32, 0.4166666666666667, 0.4897959183673469, 0.546875, 0.5925925925925926, 0.63, 0.6611570247933884, 0.6875, 0.7100591715976331, 0.7295918367346939, 0.7466666666666667, 0.76171875, 0.7750865051903114, 0.7870370370370371),
(0.23148148148148148, 0.31486880466472306, 0.3828125, 0.438957475994513, 0.486, 0.5259203606311045, 0.5601851851851852, 0.589895311788803, 0.6158892128279884, 0.6388148148148148, 0.6591796875, 0.6773865255444739, 0.693758573388203),
(0.17992503123698458, 0.251220703125, 0.31214753848498705, 0.3645, 0.4098080732190424, 0.44931520061728397, 0.48401666608312033, 0.514707413577676, 0.5420246913580247, 0.5664825439453125, 0.5884987009255157, 0.6084152568206066),
(0.14654541015625, 0.20809835898999135, 0.26244, 0.3104606615295776, 0.35303337191358025, 0.3909365379902126, 0.4248378651752246, 0.45530074074074073, 0.48279762268066406, 0.5077243694259351, 0.5304133008179648),
(0.12331754606814303, 0.177147, 0.22578957202151098, 0.2696782702117627, 0.30931242566258577, 0.34518076545487, 0.3777309849108368, 0.4073604941368103, 0.4344165727708536, 0.4592004039488861),
(0.1062882, 0.15394743546921202, 0.1977640648219593, 0.237932635125066, 0.27473571128040675, 0.30848030434385004, 0.3394670784473419, 0.3679763910529583, 0.39426297308742747),
(0.09330147604194668, 0.13596279456509702, 0.17570409978466411, 0.21259310991936234, 0.24678424347508002, 0.27846908778883517, 0.30784952976979524, 0.33512352712431337),
(0.08308837445644818, 0.12164129985092131, 0.15792631022581202, 0.19194330048061778, 0.22376980268745683, 0.25352314216336075, 0.28133826968460873),
(0.07485618452364388, 0.10998439462154766, 0.14331766435886129, 0.17482015834957565, 0.20452287098892968, 0.2324948200865864),
(0.06808557762286284, 0.1003223650512029, 0.13111511876218174, 0.1604100948932782, 0.18821009245104614),
(0.06242280492074847, 0.09219031787965903, 0.12077936556670357, 0.14812831350313815),
(0.057618948674786896, 0.08525602275296724, 0.11191917020237105),
(0.053493975060685324, 0.07927607889334616),
(0.049914568192106844,))
function g1(s)
b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17 = 1 / (s[2] - s[1]), 1 / (s[3] - s[2]), 1 / (s[4] - s[3]), 1 / (s[5] - s[4]), 1 / (s[6] - s[5]), 1 / (s[7] - s[6]), 1 / (s[8] - s[7]), 1 / (s[9] - s[8]), 1 / (s[10] - s[9]), 1 / (s[11] - s[10]), 1 / (s[12] - s[11]), 1 / (s[13] - s[12]), 1 / (s[14] - s[13]), 1 / (s[15] - s[14]), 1 / (s[16] - s[15]), 1 / (s[17] - s[16]), 1 / (s[18] - s[17])
a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15, a16, a17 = s[1] * b1, s[2] * b2, s[3] * b3, s[4] * b4, s[5] * b5, s[6] * b6, s[7] * b7, s[8] * b8, s[9] * b9, s[10] * b10, s[11] * b11, s[12] * b12, s[13] * b13, s[14] * b14, s[15] * b15, s[16] * b16, s[17] * b17
index = 1
a1 = muladd(a1, -CC[index][1], a2)
b1 = muladd(b1, -CC[index][1], b2)
a2 = muladd(a2, -CC[index][2], a3)
b2 = muladd(b2, -CC[index][2], b3)
a3 = muladd(a3, -CC[index][3], a4)
b3 = muladd(b3, -CC[index][3], b4)
a4 = muladd(a4, -CC[index][4], a5)
b4 = muladd(b4, -CC[index][4], b5)
a5 = muladd(a5, -CC[index][5], a6)
b5 = muladd(b5, -CC[index][5], b6)
a6 = muladd(a6, -CC[index][6], a7)
b6 = muladd(b6, -CC[index][6], b7)
a7 = muladd(a7, -CC[index][7], a8)
b7 = muladd(b7, -CC[index][7], b8)
a8 = muladd(a8, -CC[index][8], a9)
b8 = muladd(b8, -CC[index][8], b9)
a9 = muladd(a9, -CC[index][9], a10)
b9 = muladd(b9, -CC[index][9], b10)
a10 = muladd(a10, -CC[index][10], a11)
b10 = muladd(b10, -CC[index][10], b11)
a11 = muladd(a11, -CC[index][11], a12)
b11 = muladd(b11, -CC[index][11], b12)
a12 = muladd(a12, -CC[index][12], a13)
b12 = muladd(b12, -CC[index][12], b13)
a13 = muladd(a13, -CC[index][13], a14)
b13 = muladd(b13, -CC[index][13], b14)
a14 = muladd(a14, -CC[index][14], a15)
b14 = muladd(b14, -CC[index][14], b15)
a15 = muladd(a15, -CC[index][15], a16)
b15 = muladd(b15, -CC[index][15], b16)
a16 = muladd(a16, -CC[index][16], a17)
b16 = muladd(b16, -CC[index][16], b17)
index = 2
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
a10 = a11 - a10*CC[index][10]
b10 = b11 - b10*CC[index][10]
a11 = a12 - a11*CC[index][11]
b11 = b12 - b11*CC[index][11]
a12 = a13 - a12*CC[index][12]
b12 = b13 - b12*CC[index][12]
a13 = a14 - a13*CC[index][13]
b13 = b14 - b13*CC[index][13]
a14 = a15 - a14*CC[index][14]
b14 = b15 - b14*CC[index][14]
a15 = a16 - a15*CC[index][15]
b15 = b16 - b15*CC[index][15]
index = 3
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
a10 = a11 - a10*CC[index][10]
b10 = b11 - b10*CC[index][10]
a11 = a12 - a11*CC[index][11]
b11 = b12 - b11*CC[index][11]
a12 = a13 - a12*CC[index][12]
b12 = b13 - b12*CC[index][12]
a13 = a14 - a13*CC[index][13]
b13 = b14 - b13*CC[index][13]
a14 = a15 - a14*CC[index][14]
b14 = b15 - b14*CC[index][14]
index = 4
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
a10 = a11 - a10*CC[index][10]
b10 = b11 - b10*CC[index][10]
a11 = a12 - a11*CC[index][11]
b11 = b12 - b11*CC[index][11]
a12 = a13 - a12*CC[index][12]
b12 = b13 - b12*CC[index][12]
a13 = a14 - a13*CC[index][13]
b13 = b14 - b13*CC[index][13]
index = 5
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
a10 = a11 - a10*CC[index][10]
b10 = b11 - b10*CC[index][10]
a11 = a12 - a11*CC[index][11]
b11 = b12 - b11*CC[index][11]
a12 = a13 - a12*CC[index][12]
b12 = b13 - b12*CC[index][12]
index = 6
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
a10 = a11 - a10*CC[index][10]
b10 = b11 - b10*CC[index][10]
a11 = a12 - a11*CC[index][11]
b11 = b12 - b11*CC[index][11]
index = 7
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
a10 = a11 - a10*CC[index][10]
b10 = b11 - b10*CC[index][10]
index = 8
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
a9 = a10 - a9*CC[index][9]
b9 = b10 - b9*CC[index][9]
index = 9
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
a8 = a9 - a8*CC[index][8]
b8 = b9 - b8*CC[index][8]
index = 10
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
a7 = a8 - a7*CC[index][7]
b7 = b8 - b7*CC[index][7]
index = 11
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
a6 = a7 - a6*CC[index][6]
b6 = b7 - b6*CC[index][6]
index = 12
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
a5 = a6 - a5*CC[index][5]
b5 = b6 - b5*CC[index][5]
index = 13
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
a4 = a5 - a4*CC[index][4]
b4 = b5 - b4*CC[index][4]
index = 14
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
a3 = a4 - a3*CC[index][3]
b3 = b4 - b3*CC[index][3]
index = 15
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
a2 = a3 - a2*CC[index][2]
b2 = b3 - b2*CC[index][2]
index = 16
a1 = a2 - a1*CC[index][1]
b1 = b2 - b1*CC[index][1]
return a1 / b1
end |
Isn't this just 16 pairs of evalpolys of degrees 1 to 16? |
This is insanity. But I'm into it. Would you mind sharing your un-optimized Also, you mention differentiability with Enzyme---that's pretty cool! Would you also mind sharing those snippets/tests? I just tried in a fresh environment pulling the most recent versions and didn't have much luck for any routines that weren't the fixed order variety ( |
Yes - but I think the difference here is that you have to store the
Haha ya me too 😄 The code was originally based on https://github.com/MikaelSlevinsky/SequenceTransformations.jl and once you peel away all the generated functions and what have you you get a code that looks like... function g(s)
n = length(s) - 1
l = similar(s)
N = similar(l)
D = similar(N)
for k in 1:n
w = s[k+1] - s[k]
N[k], D[k] = s[k] / w, 1.0 / w
kp1 = k + 1.0
kdkp1 = k/kp1
kdkp1jm2 = 1/kdkp1
for j in 1:k-1
cst = -(kp1-j)*kdkp1jm2/kp1
N[k-j] = muladd(cst,N[k-j],N[k-j+1])
D[k-j] = muladd(cst,D[k-j],D[k-j+1])
kdkp1jm2 *= kdkp1
end
l[k] = N[1]/D[1]
end
return l
end Of course I decided to just completely unroll everything and precompute the
Ya I filed a few issues a couple of weeks ago and Billy fixed them on main. They are working on a new rule system I believe so it's been a while since they have released a tagged version so you'll have to try on Enzyme#main. julia> @benchmark besselj(v, 1.2) setup=(v=rand())
BenchmarkTools.Trial: 10000 samples with 981 evaluations.
Range (min … max): 62.096 ns … 102.404 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 67.406 ns ┊ GC (median): 0.00%
Time (mean ± σ): 65.869 ns ± 2.674 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆▇▅ ▅█▆▂ ▁▂ ▂
▅████████▄▄▃▁▁▁▄▄▄▃▃▁▄▁▅███▆▆▃▁▁▁▃▃▅█████▅▄▃▁▄▁▁▁▄▄▃▄▁▃▁▇███ █
62.1 ns Histogram: log(frequency) by time 70.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark autodiff(Forward, v -> besselj(v, 1.2), Duplicated, Duplicated(2.4, 1.0))
BenchmarkTools.Trial: 10000 samples with 952 evaluations.
Range (min … max): 94.625 ns … 137.693 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 94.889 ns ┊ GC (median): 0.00%
Time (mean ± σ): 95.189 ns ± 1.758 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇█▇▄▁ ▂▃▂ ▂
▇██████▅▁▁▃▄▁▄▁▅▁▁▁▁▄▄▁▁▁▃▃▄▁▄▃▃▁▁▅▇████▄▄▁▃▃▃▃▁▁▄▃▁▄▄▅▆▆▆▆▆ █
94.6 ns Histogram: log(frequency) by time 99.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> Here is the code working on |
Yeah, my suggestion was the hybrid of this looped verison with your unrolled version (where you precompute |
Ya I think my initial attempt at this removing that function g(s)
n = length(s) - 1
l = similar(s)
N = similar(l)
D = similar(N)
for k in 1:n
w = s[k+1] - s[k]
N[k], D[k] = s[k] / w, 1.0 / w
for j in 1:k-1
N[k-j] = muladd(-2.1,N[k-j],N[k-j+1])
D[k-j] = muladd(-2.1,D[k-j],D[k-j+1])
end
l[k] = N[1]/D[1]
end
return l
end julia> p = rand(16);
julia> @benchmark g($p)
BenchmarkTools.Trial: 10000 samples with 200 evaluations.
Range (min … max): 406.665 ns … 2.451 μs ┊ GC (min … max): 0.00% … 82.54%
Time (median): 416.250 ns ┊ GC (median): 0.00%
Time (mean ± σ): 423.480 ns ± 92.645 ns ┊ GC (mean ± σ): 1.09% ± 4.06%
▁▃▅▆████▇▆▅▄▃▂▁▁▁▂▂▂▂▂ ▁▁▁▁ ▂
▅▆█████████████████████████▇█▇▇███████▇▇▇▆▅▆▆▆▆▅▆▆▅▂▄▄▅▃▄▄▆▅ █
407 ns Histogram: log(frequency) by time 466 ns <
Memory estimate: 576 bytes, allocs estimate: 3.
julia> p = rand(20);
julia> @benchmark g($p)
BenchmarkTools.Trial: 10000 samples with 171 evaluations.
Range (min … max): 632.795 ns … 2.940 μs ┊ GC (min … max): 0.00% … 77.83%
Time (median): 647.421 ns ┊ GC (median): 0.00%
Time (mean ± σ): 656.425 ns ± 106.955 ns ┊ GC (mean ± σ): 0.79% ± 3.76%
▂▂▄▆▇█▇▇▆▄▃▁▁▁▁▁▃▃▃▂▁▁ ▁▁▁▂ ▂
▂▄▅▆▇███████████████████████████████▇▇▇▇▅▆▆▆▆▆▅▆▅▅▄▅▄▄▂▅▄▃▅▄▄ █
633 ns Histogram: log(frequency) by time 713 ns <
Memory estimate: 672 bytes, allocs estimate: 3.
julia> p = rand(24);
julia> @benchmark g($p)
BenchmarkTools.Trial: 10000 samples with 45 evaluations.
Range (min … max): 908.333 ns … 10.032 μs ┊ GC (min … max): 0.00% … 90.52%
Time (median): 934.244 ns ┊ GC (median): 0.00%
Time (mean ± σ): 945.136 ns ± 218.800 ns ┊ GC (mean ± σ): 0.56% ± 2.21%
▁▄▄▅▆▇███▇▆▄▂▁ ▁▃▃▄▃▃▃▁▁ ▂
▄▃▄▄▆▇██████████████▇▅▆▇██████████████▇█▇▇█▇█▇▇█▆▅▅▄▃▄▅▅▅▄▆▅▆ █
908 ns Histogram: log(frequency) by time 1.02 μs <
Memory estimate: 768 bytes, allocs estimate: 3. So the unrolled tuple version is over 100x faster (of course this is much more memory friendly) |
Would that vectorize better than a fully unrolled one? Looking at the crazy |
Both of the versions vectorize in a sense that the numerator and denominator are computed simultaneously. There is an opportunity to further vectorize this if we want to store two different arrays with offset indexes that allow for computation of the whole vector but if we flush the SIMD registers for just a numerator (say we reduce the total computations by 4x) we still have to compute that twice for denominator and numerator since that is not going to be able to vectorize well. Though I don't think it becomes worth it to go much further than to SIMD the numerator and denominator (which is the natural choice) because no matter what we have to compute 16->15->14 multiplies so if we unroll by a factor of 4 that means we will have 4->4->4->4->3.... multiply structure. As we will have excess multiples in the registers that won't quite fill them all. So if the naive case is (n*(n+1)/2) multiplies (so 136 for n=16) then the perfect SIMD version would have (4 * 4 + 4 * 3 ...) so 40 multiplies. Now the naive case for the 136 non SIMD version will vectorize the numerator and denominotor but if we flush completely the single evaluation then we have 40*2 so the SIMD has a perfect case for 80 multiplies compared to 136. That's a lot but it starts to become not worth it because we have to store additional vectors and then have them packed efficiently I doubt that it will help much when all said and done. I initially tried it but I then said what happens when we move to the complex domain? It will make more sense to keep the extra SIMD registers to compute those complex additions better.
I think you are right here. My initial attempt was not type stable as the tuple length changed in each iteration so need to be aware of that. |
Wow, well honestly a lot of that is pretty over my head here. I have never really taken SIMD seriously. I'm trying to write the function g1_gen(s)
# declare the a and b scalar variables:
Base.Cartesian.@nexprs 17 j->(b_j = inv(s[j+1] - s[j]))
Base.Cartesian.@nexprs 17 j->(a_j = b_j*s[j])
# first sequence of operations for index=1:
CC1 = CC[1]
Base.Cartesian.@nexprs 16 j->begin
a_j = muladd(a_j, -CC1[j], a_{j+1})
b_j = muladd(a_j, -CC1[j], b_{j+1})
end
# outer loop over "index".
# STUMPER: Is there some way to make sure that this gets
# expanded before the "inner" @nexprs, so that {16-index_m1} expands to the
# raw int before the inner call to @nexprs?
Base.Cartesian.@nexprs 15 index_m1->begin
index = index_m1+1
# inner loop over the variable index:
Base.Cartesian.@nexprs {16-index_m1} j->begin
a_j = a_{j+1}*CC[index][j]
b_j = b_{j+1}*CC[index][j]
end
end
a_1/b_1
end But if there is some trick to make that outer |
Thanks for the idea 😄 I think this can go into an example of the what the heck / or amazing examples of julia but function g1(s)
@nexprs 17 i -> b_{i} = 1 / (s[i + 1] - s[i])
@nexprs 17 i -> a_{i} = s[i] * b_{i}
@nexprs 16 index -> (@nexprs 17-index i -> (a_{i}, b_{i}) = (muladd(a_{i}, -CC[index][i], a_{i+1}), muladd(b_{i}, -CC[index][i], b_{i+1})))
return a_1 / b_1
end I believe that is correct (at least it returns correct answer ) and what you were looking for in like three lines of code! julia> @benchmark g1($(0.11274150564997776, 0.11475646021904119, 0.11370868384312821, 0.11451023277070166, 0.11369906525599734, 0.11472113632452476, 0.11317926911257481, 0.11588910073757686, 0.11045137194340607, 0.12271888810305538, 0.09198318397942494, 0.17666004884002684, -0.07776136204113551, 0.7501986007978472, -2.150973108989948, 8.739300196625988, -34.86022928456346, 150.58310277542904))
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
Range (min … max): 41.835 ns … 73.957 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 41.961 ns ┊ GC (median): 0.00%
Time (mean ± σ): 42.132 ns ± 1.291 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▆▃▂ ▁
█████▅▁▁▁▁▁▁▁▄▁▁▁▃▃▁▃▃▄▄▄▄▃▅▄▁▁▁▄▃▁▁▄▁▅███▅▅▇▄▁▁▁▁▃▄▁▅▃▁▃▃▄ █
41.8 ns Histogram: log(frequency) by time 46 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
I think we could use a generated function as well based on the NTuple length of @oscardssmith I'm curious what you feel like including code like that in the package considering compile times / package load times / cache pressure? The generated llvm code is immense and how does that affect other code who may call |
in terms of speed, the generated function + nexprs approach will be really good. and I don't think the cache pressure/load times will be much worse than the vectorized evalpolys (which also unroll). The one thing you could do that would help is generating the n^2 nexprs as vectorized muladds in the first place (since doing so will reduce the number of instructions by 4x/8x). |
I can't believe it. I just had to stop trying to be clever with the |
Haha ya one thing about these type of expressions is they aren't forgiving :) but ya this is really elegant. I like it a lot and it's very fast. I would think it would be good to move forward with this type of approach. I will work on Oscar's suggestion a little bit to reduce the number of instructions but here is a fully general version of Levin's sequence transformation (which works for any N). And also because we are unrolling it like this we can just put the constant factor within the function at no cost which helps it be fully general. @inline levin_scale(B, n, k) = -(B + n) * (B + n + k)^(k - 1.0) / (B + n + k + 1.0)^k
@generated function levin_transform(s::NTuple{N, Float64}) where N
len = N-1
len2 = N-2
:(
begin
@nexprs $len i -> b_{i} = 1 / (s[i + 1] - s[i])
@nexprs $len i -> a_{i} = s[i] * b_{i}
@nexprs $len2 k -> (@nexprs ($len-k) i -> (a_{i}, b_{i}) = (muladd(a_{i}, Levin_scale(1.0, i, k-1), a_{i+1}), muladd(b_{i}, Levin_scale(1.0, i, k-1), b_{i+1})))
return a_1 / b_1
end
)
end One thing to note is we aren't quite done yet is that this is a sequence transform so we need to generate efficiently the sequence ( |
Ok here is an actual version that we can test and improve to compute This version works with SIMDMath.jl now to explicitly vectorize all these calculations. O and I have a version that works for complex numbers! using Base.Cartesian
using SIMDMath: VE, muladd, LVec, FloatTypes
@inline levin_scale(B::T, n, k) where T = -(B + n) * (B + n + k)^(k - one(T)) / (B + n + k + one(T))^k
# compute the scaled modified bessel function using the Levin-u transformation
# v = 0.2; x = 4.3
# besselkx(v, x, Val(19))
@generated function besselkx_levin(v, x::T, ::Val{N}) where {T <: FloatTypes, N}
len = N-1
len2 = N-2
:(
begin
# generate a sequence of partial sums and weights of the asymptotic expansion for large argument besselk
l = let (out, term, fv2) = (one(typeof(x)), one(typeof(x)), 4*v^2)
@ntuple $N i -> begin
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (8 * x * i)
out += term
# this check is relatively cheap but if the normal sum converges then we should return value
abs(term) <= eps(T) && return out * sqrt(π / 2x)
# need to return the weights and weighted partial sums
invterm = inv(term)
LVec{2, T}((out * invterm, invterm))
end
end
# this is required because you can't set the index of a tuple
@nexprs $len i -> a_{i} = l[i]
@nexprs $len2 k -> (@nexprs ($len-k) i -> a_{i} = muladd(a_{i}, levin_scale(one(T), i, k-1), a_{i+1}))
return (a_1[1].value / a_1[2].value) * sqrt(π / 2x)
end
)
end
@generated function besselkx_levin(v, x::Complex{T}, ::Val{N}) where {T <: FloatTypes, N}
len = N-1
len2 = N-2
:(
begin
# generate a sequence of partial sums and weights of the asymptotic expansion for large argument besselk
l = let (out, term, fv2) = (one(typeof(x)), one(typeof(x)), 4*v^2)
@ntuple $N i -> begin
offset = muladd(2, i, -1)
term *= @fastmath muladd(offset, -offset, fv2) / (8 * x * i)
out += term
# this check is relatively cheap but if the normal sum converges then we should return value
(abs(real(term)) <= eps(T) && abs(imag(term)) <= eps(T)) && return out * sqrt(π / 2x)
# need to return the weights and weighted partial sums
invterm = @fastmath inv(term)
LVec{4, T}((reim(out * invterm)..., reim(invterm)...))
end
end
# this is required because you can't set the index of a tuple
@nexprs $len i -> a_{i} = l[i]
@nexprs $len2 k -> (@nexprs ($len-k) i -> a_{i} = muladd(a_{i}, levin_scale(one(T), i, k-1), a_{i+1}))
return (complex(a_1[1].value, a_1[2].value) / complex(a_1[3].value, a_1[4].value)) * sqrt(π / 2x)
end
)
end
besselk_levin(v, x::T, ::Val{N}) where {T <: Union{Complex{FloatTypes}, FloatTypes}, N} = exp(-x) * besselkx_levin(v, x, Val(N))
I've added some comments in there about what it is actually doing (though still somewhat terse) but I think this is ready to be workshopped a little bit. The real version will generate <2 x double> instructions where the complex will generate <4 x double> instructions. The main challenge with the complex code is the runtime is dominated by the divisions in the sequence generation not in the transformation. Some examples and benchmarks. julia> v = 0.2
0.2
julia> x = 4.3
4.3
julia> besselkx_levin(v, x, Val(16))
0.5911957440242439
julia> using SpecialFunctions
julia> besselkx(v, x)
0.591195744024244
julia> using BenchmarkTools
julia> @benchmark besselkx_levin($v, $x, Val(16))
BenchmarkTools.Trial: 10000 samples with 987 evaluations.
Range (min … max): 50.411 ns … 90.210 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 50.954 ns ┊ GC (median): 0.00%
Time (mean ± σ): 51.462 ns ± 4.162 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▁ ▁
██▁██▄▄▁▄▁▁▁▁▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃█ █
50.4 ns Histogram: log(frequency) by time 86.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
And now for complex numbers!!! julia> x = 4.3 + 1.1im
4.3 + 1.1im
julia> besselkx_levin(v, x, Val(16))
0.5783561826343409 - 0.06993526928128495im
julia> besselkx(v, x)
0.578356182634341 - 0.0699352692812849im So you might think that the complex version will be much much slower considering all the divisions and stuff but because we are able to vectorize it pretty well it is only about 3x slower than the real version (I'm sure I can play around with the sequence generation to get a bit faster). julia> @benchmark besselkx_levin($v, $x, Val(16))
BenchmarkTools.Trial: 10000 samples with 803 evaluations.
Range (min … max): 155.110 ns … 180.239 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 155.340 ns ┊ GC (median): 0.00%
Time (mean ± σ): 155.431 ns ± 0.667 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▆███▆▄▁ ▁▁ ▂
▄█████████▃▁▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▇█████▇ █
155 ns Histogram: log(frequency) by time 158 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Now that I'm looking at the code more we can just combine the real and complex version and have a branch because of the type information that will be eliminated at runtime for the versions. |
Okay, not sure if this is the right place for it, but I've been playing with [...]
abs(term) <= eps(T) && return out * sqrt(π / 2x)
[...] But if you just comment that out, then the My next naive thought is to just stop updating the things stored in the tuple when the series has converged: [...]
if abs(term) > eps(T)
out += term
invterm = inv(term)
end
[...] And just by putting that little branch there now the tuple values will just stay constant after BUT, the issue that still comes up is with the actual Levin transform stuff. And this is the part that I don't understand super well, but each of those steps starts at the first index, so even though the series is converged, you still get different numbers if you compare I don't think this is impossible, but just an update because it will probably be relevant in other places too. I think we just need to find a way to let terms be zero in peace. I'll keep tinkering with this for now though, unless your rewrite of the Levin code will fix this for free you think. |
Hmmmm... I actually don't think this should be an issue with the Levin transform in itself but the generation of the partial sums. For instance if we look at the expansion in NIST 10.40.2 function _besselk_large_argument(v, x::T) where T
MaxIter = 50
S = eltype(x)
v, x = S(v), S(x)
fv2 = 4 * v^2
term = one(S)
res = term
s = term
for i in 1:MaxIter
@show res
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (8 * x * i)
res = muladd(term, s, res)
#abs(term) <= eps(T) && break
end
return res
end These terms just generate ones. Which when we plug in the weight function this becomes zeros for the weights. Now zeros for the weights will just generate NaNs really quickly. But what we need to have happen is to differentiate to the form given in 10.40.8 which I showed in the other issue before calculating the weights..... So to me if ForwardDiff can work with the regular asymptotic expansion it should also work here. Our partial sums we generate our just not right. This seems to be an issue with how I've set the code up and not in the actual transform itself? Will look more at this. |
For sure, I'm with you about that---I just thought there was something about the Levin transform that required keeping that second EDIT: okay, wrote that too quickly. We do need that for the Levin transform one way or another, and I'm pretty sure early termination will not give correct derivatives. This was the reason we started using the exponential improvement for the half-integer case in |
Could you also post code that produces wrong results? Like does it produce zeros or the wrong results and is this just for v=1/2 ? I just tried with enzyme and it seemed reasonable? julia> autodiff(Forward, v-> besselkx_levin(v, 12.0, Val(16)), Duplicated, Duplicated(0.5, 1.0))
(0.3618006272791338, 0.015075026136630575)
julia> besselkx(0.5, 12.0)
0.3618006272791339
julia> autodiff(Forward, v-> besselkx_levin(v, 12.0, Val(16)), Duplicated, Duplicated(1.5, 1.0))
(0.39195067955239493, 0.04710945667697055)
julia> besselkx(1.5, 12.0)
0.3919506795523951
|
I'll write more on this tomorrow, but unfortunately the answers that seem reasonable are often wrong for this edge case. Consider: using FiniteDifferences, Bessels
const FDM = central_fdm(10,1)
function ref_dbesselkx_dv(v, x)
f2 = Base.Fix2(Bessels.besselkx, x)
FDM(f2, v)
end
ref_dbesselkx_dv(0.5, 12.0) # ~ 0.0145
ref_dbesselkx_dv(1.5, 12.0) # ~ 0.0470 So what you're seeing is not catastrophically inaccurate, but also only correct to a couple significant digits. Here is the version of your code that I am using to play with @inline function levin_scale(B::T, n, k) where T
-(B + n) * (B + n + k)^(k - one(T)) / (B + n + k + one(T))^k
end
@generated function _besselkx_levin(v::V, x::T, ::Val{N}) where {V, T, N}
len = N-1
len2 = N-2
outT = promote_type(V, T)
quote
begin
l = let (out, term, fv2, eightx) = (one($outT), one($outT), 4*v^2, 8*x)
Base.Cartesian.@ntuple $N i -> begin
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (eightx*i)
out += term
abs(term) <= 1e-15 && return out * sqrt(π / 2x)
invterm = inv(term)
(out*invterm, invterm)
end
end
Base.Cartesian.@nexprs $len i -> a_{i} = l[i]
Base.Cartesian.@nexprs $len2 k -> begin
Base.Cartesian.@nexprs ($len-k) i -> begin
sc = levin_scale(one(T), i, k-1)
a_i = (muladd(a_{i}[1], sc, a_{i+1}[1]),
muladd(a_{i}[2], sc, a_{i+1}[2]))
end
end
return $outT((a_1[1]/a_1[2]) * sqrt(π / 2x))
end
end
end
function our_dbesselkx_dv(v, x)
f2 = _v -> _besselkx_levin(_v, x, Val(16))
ForwardDiff.derivative(f2, v)
end And again you see: julia> ref_vs_ours(v, x) = (ref_dbesselkx_dv(v, x) - our_dbesselkx_dv(v, x))/abs(ref_dbesselkx_dv(v, x))
ref_vs_ours (generic function with 1 method)
julia> ref_vs_ours(1.5, 12.0)
-0.0020240634451386256
julia> ref_vs_ours(1.501, 12.0)
-2.9332885207854924e-13 The reason for this is the early $\mathcal{K}{\nu}(x)=\sum{j=1}^{\text{ceil}(\nu)}[...]+R_{\nu}(x)$ where Anyways, this comment is getting long, sorry. I do think this can be addressed, but it will probably in the end involve finding a way to accumulate coefficients even when the not-derivative part is 0, if you know what I mean. EDIT: Ugh, I give up on trying to make that tex equation render. Hopefully it is still clear enough. |
So you are saying we should just use finite differences 😄 This is an interesting problem (and at least we seem to be tackling toughest case first. I just want to take one step back to make sure I see the problem. If we look at the regular asymptotic expansion function besselk_large_argument(v, x::T) where T
a = exp(-x / 2)
coef = a * sqrt(pi / 2x)
return T(_besselk_large_argument(v, x) * coef * a)
end
function _besselk_large_argument(v, x::T) where T
MaxIter = 50
S = eltype(x)
v, x = S(v), S(x)
fv2 = 4 * v^2
term = one(S)
res = term
s = term
for i in 1:MaxIter
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (8 * x * i)
res = muladd(term, s, res)
#abs(term) <= eps(T) && break
end
return res
end And I modify the code to just run for 50 terms no matter what we can get good derivatives? At least that seems to match ok. julia> autodiff(Forward, v->besselk_large_argument(v, 55.0), Duplicated, Duplicated(0.5, 1.0))[2]
1.97876284377238e-27
julia> dvbesselk(ArbFloat(0.5), ArbFloat(55.0))
1.97876284377238030106336346605515496107532396354231328952625258552628280879127918412731333961880277672326758481184840407518354008994240482915254186001225691837201407244641556100665654134700311440503454914579578322988897030296403705253013030240425314170276324542186929853440167150195734242472842029798051876839218023694971818866179030774694381722849809203226358554462257075547987853989971279198759455911150377220806768520043019343563369383924297628343713389132011187097644750889728058214066854106251841856730816010027578080074991956856259721740611201807109150298939419064511338703147526835965006e-27 So that is right to every digit. It does seem like the derivative is more slowly converging though so need to be cautious with that termination criteria? So the next question then is how does that translate to the sequence transform.... I'd be interested to see if we can separate out the derivative partial sums from the actual transform itself to see if we can feed in the right partial sums if that helps converge the sum. One thing to keep in mind is that the sequence transforms only work well for diverging alternating series. It looks like the derivative will satisfy but might have slightly different convergent characteristics? |
Ha, I would definitely not suggest just doing finite differences. But yeah, what you're describing about just having a bunch of terms to compensate is exactly what we do in |
Love this problem 😄! Super interesting. The good news though is the sequence transform could drastically reduce the number of terms we need so I think this is still very promising especially if the derivative takes even more terms. With that said I think this might be just an issue with how I originally coded the first problem If we take this all the way back (I'm just going to put the general code here). function g(s)
n = length(s) - 1
l = similar(s)
N = similar(l)
D = similar(N)
for k in 1:n
w = s[k+1] - s[k]
N[k], D[k] = s[k] / w, 1.0 / w
kp1 = k + 1.0
kdkp1 = k/kp1
kdkp1jm2 = 1/kdkp1
for j in 1:k-1
cst = -(kp1-j)*kdkp1jm2/kp1
N[k-j] = muladd(cst,N[k-j],N[k-j+1])
D[k-j] = muladd(cst,D[k-j],D[k-j+1])
kdkp1jm2 *= kdkp1
end
l[k] = N[1]/D[1]
end
return l
end
function _besselk_large_argument(v, x::T, iter) where T
MaxIter = 50
S = eltype(x)
v, x = S(v), S(x)
fv2 = 4 * v^2
term = one(S)
res = term
s = term
for i in 1:iter
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (8 * x * i)
res = muladd(term, s, res)
#abs(term) <= eps(T) && break
end
return res * sqrt(pi / 2x)
end Now what we need to realize is that the sequence transform will be able to handle any sequence but we don't necessarily need to take the derivative of the sequence transform we only need to take the derivative of the sequence then feed that sequence into the transform. I'm hoping that makes sense. So essentially this is all we need to do... v = 0.5; x = 5.0
# generate our partial sums but using the derivative
out = [autodiff(Forward, v->_besselk_large_argument(v, x, iter), Duplicated, Duplicated(v, 1.0))[2] for iter in 1:30];
# do the sequence transform on those derivative partial sums
julia> g(out)
30-element Vector{Float64}:
0.05604991216397929
0.051379086150314356
0.05131596687985941
0.0513217030639802
0.051321108923874925
0.051321175118541326
0.05132116743538506
0.051321168343865145
0.051321168236451525
0.05132116824892151
0.051321168247529726
0.051321168247674714
0.051321168247661336
0.051321168247662294
0.05132116824766225
0.05132116824766225
0.05132116824766225
0.05132116824766225
0.051321168247662266
0.05132116824766226
0.05132116824766225
0.05132116824766226
0.05132116824766225
0.05132116824766226
0.05132116824766225
0.051321168247662266
0.05132116824766226
0.05132116824766225
0.051321168247662266
julia> dvbesselk(ArbFloat(v), ArbFloat(x))[2] * exp(ArbFloat(x))
0.05132116824766225501961245911721947725432532303036730970326023167364314434751562772143246443214094748722 So as you can see this only needs 14 terms to give us a derivative accurate to every digit! Edit. And only about 18-20 terms for julia> v = 0.5; x = 2.5
julia> out = [autodiff(Forward, v->_besselk_large_argument(v, x, iter), Duplicated, Duplicated(v, 1.0))[2] for iter in 1:30];
julia> dvbesselk(ArbFloat(v), ArbFloat(x))[2] * exp(ArbFloat(x))
0.135087772677340579105023651731665921186421969870307985751490770136161243722610767275174444024496555435028836619099605088362489867464669721497571410861968552056029631007269326971586330043379328447709022231407430109494818834510767049995833607446153636099383460123617983835365629621156133602351560255916340784931099706116063704942997261114836463603880844812376501562359483149734726825274119823214982811149094563989864455340693045374377573412862809968997866980601751064617778251759629803335674838979665361128354655316750064362682923399534613558675702730058772076161233575816259345806490530088060237
julia> g(out)[1:20]
20-element Vector{Float64}:
0.15853309190424045
0.1358855073464918
0.13498857330460076
0.13510091101392382
0.1350860461979384
0.135087985897801
0.1350877496735732
0.13508777449654483
0.13508777266063987
0.13508777264824284
0.1350877726838682
0.135087772676591
0.1350877726773422
0.13508777267736022
0.13508777267733627
0.1350877726773409
0.13508777267734065
0.13508777267734057
0.1350877726773406
0.13508777267734062 |
Wow, okay, very nice! So we really just need to somehow get those derivatives evaluated before the acceleration computes |
Yes - I think there's two things about this problem.
To me the second point has to be some general problem with AD itself for iterative processes. I'm wondering if this is already solved? I'm not as familiar with AD honestly but the problem to me could be generally stated as for an iterative sum (with This issue has to arise frequently? As I would mostly expect the sum of derivatives to be slower converging (at least require a couple more higher order terms). Is it possible to have two separate stopping criteria when AD is used? |
Can the Implicit Function theorem be used here? |
Okay, so after looking at the using Bessels, Enzyme
using Enzyme.EnzymeRules
using Base.Cartesian: @nexprs, @ntuple
@inline levin_scale(B::T, n, k) where T = -(B + n + k) * (B + n + k - 1) / ((B + n + 2k) * (B + n + 2k - 1))
@generated function levin_transform(s::NTuple{N, Float64}) where N
len = N-1
len2 = N-2
:(
begin
@nexprs $len i -> b_{i} = 1 / (s[i + 1] - s[i])
@nexprs $len i -> a_{i} = s[i] * b_{i}
@nexprs $len2 k -> (@nexprs ($len-k) i -> (a_{i}, b_{i}) = (muladd(a_{i}, levin_scale(1.0, i, k-1), a_{i+1}), muladd(b_{i}, levin_scale(1.0, i, k-1), b_{i+1})))
return a_1 / b_1
end
)
end
# Here is the sauce: a manual rule that separately does the levin transform for
# the values and dvalues. This is because if v+1/2 is an integer, the
# sequences of _values_ will be exactly converged and you'll have a divide by
# zero NaN problem, but the _dvalues_ will NOT do that, and you want all those
# terms for the sake of accuracy.
function EnzymeRules.forward(func::Const{typeof(levin_transform)},
::Type{<:Duplicated},
s::Duplicated)
sv = s.val
dv = s.dval
N = length(sv)
ls = (sv[N-1] == sv[N]) ? sv[N] : levin_transform(sv)
dls = (dv[N-1] == dv[N]) ? dv[N] : levin_transform(s.dval)
Duplicated(ls, dls)
end
@generated function besselkx_as(v, x::T, ::Val{N}) where {T, N}
quote
# generate a sequence of partial sums and weights of the asymptotic
# expansion for large argument besselk
tup = let (out, term, fv2) = (one(typeof(x)), one(typeof(x)), 4*v^2)
@ntuple $N i -> begin
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (8 * x * i)
out += term
out
end
end
end
end
@generated function besselkx_levin(v, x::T, ::Val{N}) where {T, N}
:(levin_transform(besselkx_as(v, x, Val($N)))*sqrt(pi/2x))
end
# This allocates, so there may still be something to chase down.
dbeskx_dv(v, x) = autodiff(Forward, _v->besselkx_levin(_v, x, Val(20)),
Duplicated, Duplicated(v, 1.0))[2]
#=
# A reference so that you can check that the half-integer derivatives work.
#
using FiniteDifferences
dbeskx_dv_ref(v, x) = central_fdm(10,1)(_v->besselkx(_v, x), v)
=# And you can check that the derivatives agree with the finite diff incredibly well even for So, @heltonmc: at some point, can you elaborate to me on why you have the |
This is awesome. Should we maybe split the particular Enzyme stuff into a separate issue to keep track of? Just to keep everyone in the loop I think we are going to have to go the route of custom rules for some functions. That might be only way to get great performance and accuracy. The plan was to perhaps have this as a separate module that lives within Bessels and use a weak dep. and package extension or perhaps move it into it's own separate package like BesselsAD that focuses on being compatible with perhaps a larger AD framework. @cgeoga I just pushed recent changes to the besselk_levin code over at #95. I'll paste the new functions below. @generated function besselkx_levin(v, x::T, ::Val{N}) where {T <: FloatTypes, N}
:(
begin
s = zero(T)
t = one(T)
l = @ntuple $N i -> begin
s += t
t *= (4*v^2 - (2i - 1)^2) / (8 * x * i)
invterm = inv(t)
Vec{2, T}((s * invterm, invterm))
end
return levin_transform(l) * sqrt(π / 2x)
end
)
end
@generated function besselkx_levin(v, x::Complex{T}, ::Val{N}) where {T <: FloatTypes, N}
:(
begin
s = zero(typeof(x))
t = one(typeof(x))
t2 = t
a = @fastmath inv(8*x)
a2 = 8*x
l = @ntuple $N i -> begin
s += t
b = (4*v^2 - (2i - 1)^2) / i
t *= a * b
t2 *= a2 / b
Vec{4, T}((reim(s * t2)..., reim(t2)...))
end
return levin_transform(l) * sqrt(π / 2x)
end
)
end These functions take advantage of constant propagation. They were a little faster but was just trying to improve the code clarity a little bit. I also refactored the complex code. The issue with the complex function is you need to desperately avoid any complex divisions so the different accumulators are to make sure we only perform real divisions or at worst a complex divided by a real and always avoid a real divided by complex or complex divided by complex. Even doing like ten extra multiplies is worth this cost. If you could try it with this new code that would be great on that other branch! |
Very nice. Now that I understand how to do the Also, good point about a separate issue. I'll start a new one and use that method as the top post on it when it is done. |
Perfect !! Ya I think above in this thread I showed really you just have to take the derivative of the sequences and feed them into the sequence transformation and should get good derivatives. I would actually imagine that you would be able to get derivatives then in the same time as you can scalar evaluation. One thing I just want to note here why I didn’t use the ComplexVec type in SIMDMath (considering I spent all the time developing that :) ) for the complex implementation. Only two complex numbers are able to utilize simd so using the alternating storage approach allows the operations to fit in a 4 x double simd operation. This coupled with only needing complex-real multiplies and complex-complex addition within the levin loop means that this is a really special case where the alternating storage approach is superior. I should make note of that in the files somewhere. |
@cgeoga I was thinking some more about how we compute the region between asymptotic expansions (power series, uniform expansions, and large argument). Which corresponds to roughly
0<nu<25
and2 < x < 30
...Right now, we are using the wronskian relation to
besseli
. The reason we can do this is that the power series ofbesseli
is convergent in this zone so using that along with the continued fraction and wronskian we can getbesselk
pretty accurately. Now there are a couple drawbacks to that:besseli(nu, x)
andbesseli(nu + 1, x)
and then also need to computebesselk(nu+1,x)/besselk(nu,x)
.x
gets biggerbesseli(nu, x)
suffers from cancellation for complex numbers so can't be as uniformly used.The first point was taken care of by using SIMD instructions to compute both
besseli(nu, x)
andbesseli(nu + 1, x)
. So that is pretty optimized (though we pay an extra penalty for complex numbers). But points 2 and 3 are pretty much unsolvable. Power series will always be slow for larger arguments and can't use much for complex numbers.I was toying with an alternative which links a few ideas. Now the continued fraction approach to compute
besselk(nu+1,x)/
besselk(nu,x)converges more quickly as
xgets bigger so it has an inverse relation to the power series. (i.e., power series converges faster for small
xwhile continued fraction converges faster for big
x`). So what if there was a good way to use the uniform expansions to link with the smaller orders. Now Miller's algorithm links stable backward recurrence to higher orders with a normalization factor calculated from low orders. That is fine but usually we do not know the lower orders unless for integer arguments.I'm sure this idea has been described before but can't seem to find a description of it. But there appears to be a way to link stable forward recurrence with a normalization factor calculated at higher orders to compute
besselk
at lower orders. It works by instead of considering trial start values (0 and epsilon) in the Miller backwards scheme, it uses the continued fraction approach to generate start values in the forward scheme. So the algorithm would look something like:1.0
andknup1/knu
.nu_ceil
).besselk(nu_ceil, x)
using asymptotic expansionIt may seem complicated but instead of having to compute
besseli(nu, x)
,besseli(nu + 1, x)
andbesselk(nu+1,x)/
besselk(nu,x)we just need
besselk(nu_ceil, x)and
besselk(nu+1,x)/besselk(nu,x)
. There is a few extra steps of course to generate the recurrence but that is pretty small.So now we have an algorithm that computes
besselk(nu_ceil, x)
at constant time using uniform asymptotic expansion and algorithm to computebesselk(nu+1,x)/
besselk(nu,x)which converges faster as
x` increases. A sample implementation of the algorithm would look like..Speed tests to other approach.
Which is about 40% faster for bigger arguments. For smallish arguments its about 10% faster.
And for much smaller arguments it's hard to beat the power series so it is slower.
So there is of course things to consider but this approach has the advantage of being more applicable to a large parameter set including complex numbers and has the potential for large speed gains over a fairly big domain. My initial implementation could also probably be improved to speed things up a bit but it does serve as a good way to link directly to the asymptotic expansions.
The text was updated successfully, but these errors were encountered: