Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A different approach to compute besselk in intermediate zones #74

Open
heltonmc opened this issue Dec 8, 2022 · 33 comments
Open

A different approach to compute besselk in intermediate zones #74

heltonmc opened this issue Dec 8, 2022 · 33 comments

Comments

@heltonmc
Copy link
Member

heltonmc commented Dec 8, 2022

@cgeoga I was thinking some more about how we compute the region between asymptotic expansions (power series, uniform expansions, and large argument). Which corresponds to roughly 0<nu<25 and 2 < x < 30...

Right now, we are using the wronskian relation to besseli. The reason we can do this is that the power series of besseli is convergent in this zone so using that along with the continued fraction and wronskian we can get besselk pretty accurately. Now there are a couple drawbacks to that:

  1. We need besseli(nu, x) and besseli(nu + 1, x) and then also need to compute besselk(nu+1,x)/besselk(nu,x).
  2. Power series are slower to converge as x gets bigger
  3. The power series of besseli(nu, x) suffers from cancellation for complex numbers so can't be as uniformly used.

The first point was taken care of by using SIMD instructions to compute both besseli(nu, x) and besseli(nu + 1, x). So that is pretty optimized (though we pay an extra penalty for complex numbers). But points 2 and 3 are pretty much unsolvable. Power series will always be slow for larger arguments and can't use much for complex numbers.

I was toying with an alternative which links a few ideas. Now the continued fraction approach to compute besselk(nu+1,x)/besselk(nu,x)converges more quickly asxgets bigger so it has an inverse relation to the power series. (i.e., power series converges faster for smallxwhile continued fraction converges faster for bigx`). So what if there was a good way to use the uniform expansions to link with the smaller orders. Now Miller's algorithm links stable backward recurrence to higher orders with a normalization factor calculated from low orders. That is fine but usually we do not know the lower orders unless for integer arguments.

I'm sure this idea has been described before but can't seem to find a description of it. But there appears to be a way to link stable forward recurrence with a normalization factor calculated at higher orders to compute besselk at lower orders. It works by instead of considering trial start values (0 and epsilon) in the Miller backwards scheme, it uses the continued fraction approach to generate start values in the forward scheme. So the algorithm would look something like:

  1. Generate start values 1.0 and knup1/knu.
  2. Use upward recurrence up until a order you can use asymptotic expansion (nu_ceil).
  3. Compute besselk(nu_ceil, x) using asymptotic expansion
  4. Normalize starting value

It may seem complicated but instead of having to compute besseli(nu, x), besseli(nu + 1, x) and besselk(nu+1,x)/besselk(nu,x)we just needbesselk(nu_ceil, x)andbesselk(nu+1,x)/besselk(nu,x). There is a few extra steps of course to generate the recurrence but that is pretty small.

So now we have an algorithm that computes besselk(nu_ceil, x) at constant time using uniform asymptotic expansion and algorithm to compute besselk(nu+1,x)/besselk(nu,x)which converges faster asx` increases. A sample implementation of the algorithm would look like..

using Bessels

function test(nu, x)
    fpart, _ = modf(nu)
    nu_ceil = 25.0 + fpart
    knu_ceil = Bessels.besselk_large_orders(nu_ceil, x)
    knup1 = Bessels.besselk_ratio_knu_knup1(nu, x)
    return knu_ceil / Bessels.besselk_up_recurrence(x, knup1, 1.0, nu + 1, nu_ceil)[1]
end

Speed tests to other approach.

## high precision calculation
julia> 2.308550810254649099632E-14
2.308550810254649e-14

julia> test(2.2, 30.0)
2.3085508102546522e-14

julia> Bessels.besselk_continued_fraction(2.2, 30.0)
2.3085508102546462e-14

julia> @benchmark test(2.2, x) setup = (x = 30.0 + rand())
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min  max):  79.846 ns  120.610 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     79.976 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   80.832 ns ±   2.255 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▃                    ▂▃           ▃▅▁                      ▁
  ███▆▅▅▄▃▄▄▄▄▆▄▄▃▁▁▄▃▄▁▆██▆▅▃▃▃▄▄▄▅▄▆███▅▄▅▃▃▁▃▁▄▃▁▄▃▃▁▃▃▁▁▃█ █
  79.8 ns       Histogram: log(frequency) by time      87.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Bessels.besselk_continued_fraction(2.2, x) setup = (x = 30.0 + rand())
BenchmarkTools.Trial: 10000 samples with 898 evaluations.
 Range (min  max):  127.180 ns  178.360 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     129.315 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   128.840 ns ±   1.969 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▇▂              █▇                                           
  ▃███▃▂▂▂▂▂▂▂▂▁▂▂▂███▄▃▂▂▂▂▂▃▃▃▄▃▂▂▂▂▂▂▁▂▂▂▂▃▃▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂ ▃
  127 ns           Histogram: frequency by time          134 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which is about 40% faster for bigger arguments. For smallish arguments its about 10% faster.

julia> @benchmark Bessels.besselk_continued_fraction(2.2, x) setup = (x = 10.0 + rand())
BenchmarkTools.Trial: 10000 samples with 940 evaluations.
 Range (min  max):  102.305 ns  145.612 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     104.122 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   104.488 ns ±   2.483 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▆     ▄█▅▂   ▁    ▅▄▃▁     ▁                                 ▂
  ██▇▁▁▁▁████▆▄▄█▇▅▅▄████▇▅▅▄▇█▇▆▆▇▇▅▄▁▃▄▅▅▆▅▆▅▆▆▅▄▅▄▄▃▄▄▅▅▃▄▄▄ █
  102 ns        Histogram: log(frequency) by time        115 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test(2.2, x) setup = (x = 10.0 + rand())
BenchmarkTools.Trial: 10000 samples with 954 evaluations.
 Range (min  max):  93.203 ns  134.565 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     97.615 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   96.402 ns ±   2.687 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▅              ▁       █▅              ▁         ▁▃         ▂
  ██▇▅▄▅▄▅▅▄▄▁▁▃▄▅█▆▅▄▁▁▁███▇▅▅▅▅▇▆▆▅▁▄▁▃▇█▇▆▅▅▄▃▅▁▅██▆▃▄▁▄▁▄▃ █
  93.2 ns       Histogram: log(frequency) by time       104 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And for much smaller arguments it's hard to beat the power series so it is slower.

julia> @benchmark test(2.2, x) setup = (x = 4.0 + rand())
BenchmarkTools.Trial: 10000 samples with 918 evaluations.
 Range (min  max):  115.423 ns  157.588 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     121.642 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   122.518 ns ±   4.119 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆              █             ▂█              ▆              ▃ ▂
  █▇▃▄▄▆▃▃▁▁▃▁▁▁▇██▅▃▆█▆▄▄▄▃▃▄▁███▄▁▅█▆▄▃▄▁▃▃▄▅██▃▄▄▅▅▃▁▃▁▄▁▃▁█ █
  115 ns        Histogram: log(frequency) by time        133 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Bessels.besselk_continued_fraction(2.2, x) setup = (x = 4.0 + rand())
BenchmarkTools.Trial: 10000 samples with 942 evaluations.
 Range (min  max):  100.849 ns  158.882 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     108.678 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   108.736 ns ±   4.641 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄    ▅      █▂  ▅▂  ▂▆  ▇▄   ▃  ▂▅  ▅▃   ▁   ▃   ▃            ▂
  █▄▁▄▇█▅▁▆▃▃▄██▆▆██▄▄███▅██▆▆██▆▄██▇▆██▅▆▅█▆▅██▇▅▆█▅▅▄▇▄▄▄██▁█ █
  101 ns        Histogram: log(frequency) by time        123 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So there is of course things to consider but this approach has the advantage of being more applicable to a large parameter set including complex numbers and has the potential for large speed gains over a fairly big domain. My initial implementation could also probably be improved to speed things up a bit but it does serve as a good way to link directly to the asymptotic expansions.

@cgeoga
Copy link
Contributor

cgeoga commented Jan 5, 2023

@heltonmc, apologies---I somehow completely missed this when you posted it and am just seeing it now for the first time. I like this idea a lot and tried to think about some similar ideas to leverage the large-order expansions for these intermediate argument ranges, but I never really got anywhere. This is very nice. I didn't realize that the power series cutoff was 2---for some reason I thought it was 4. But even by 3 it seems to draw with the power series, and so to me it seems like a pretty obvious win to give up ~10ns for 2 < x < 3 but then gain an increasingly substantial amount when 3 < x < 30.

I think I honestly need a little more time to grok why this works at all, though. I never really understood the Miller recursion in a very deep level and this whole concept of normalization is still a little shaky to me. But I'll definitely play with this and try to understand it, and it seems like a seriously exciting option.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 5, 2023

Not a problem!! I was working on this before I needed to switch to dissertation stuff so haven't come back to it. Hopefully will work on it a little more in the next couple of weeks.

Yes - the power series for besseli is pretty optimized and rapidly convergent for small arguments so it is hard to beat for x<3. But I was also thinking about the idea of using this approach for bessely because you can't use the besselj power series the same way you can for besseli. So it is is much more applicable to that function but also here when the besseli is slowly converging (larger arguments) or prone to severe cancellation (complex numbers).

I also like the idea of reducing the number of algorithms employed such that if the uniform expansion is the main driver of the speed and accuracy of almost all arguments that is more easily optimized than several different routines. But ya, the Miller algorithm seems to be a pretty common way to generate higher order (integer orders) functions when forward recurrence is not stable. That way you can use an optimized zero order function to normalize for any order. This doesn't work as well for besselk and bessely though when forward recurrence is stable..

Anyway, still need to have the continued fraction approach for bessely..... that could reduce significantly the amount of code and compilation costs (though that seems to be less of an issue now)

@heltonmc
Copy link
Member Author

heltonmc commented Jan 13, 2023

Was thinking about this approach for bessely again.... it might be possible to derive the continued fraction for bessely but I think its also possible to use the relation between besselk and the Hankel function. Then use the relation of the Hankel function to get bessely from besselk therefore we can just use the same approach described above using besselk_ratio_knu_knup1 with some modifications.

The first challenge is that it requires a rotation of argument by pi/2 so real arguments for bessely will require evaluation of besselk along imaginary axis. That's not too big of a deal for the continued fraction approach other than slowing it down slightly. I made some adjustments to the existing approach here to speed it up slightly for complex arguments....

function besselk_ratio_knu_knup1(v, x::ComplexOrReal{T}) where T
    MaxIter = 1000
    S = eltype(x)
    (hn, Dn, Cn) = (S(1e-50), zero(S), S(1e-50))

    jf = one(T)
    vv = v * v
    @fastmath begin
        for _ in 1:MaxIter
            an = vv - (jf - 0.5)^2
            bn = 2 * (x + jf)
            Cn = an / Cn + bn
            Dn = inv(muladd(an, Dn, bn))
            del = Dn * Cn
            hn *= del
            checktol(del - 1) && break
            jf += one(T)
        end
        xinv = (v + x + hn + 1/2) / x
    end 
    return xinv
end
@inline checktol(err::T) where T = abs(err) < eps(Float64)
@inline checktol(err::Complex{T}) where T = abs2(err) < eps(T)^2

Perhaps, it could be sped up further but it is pretty fast already for the input range we need for bessely (7<x<20)...

julia> @btime besselk_ratio_knu_knup1(v, x) setup=(x= 20.0im; v = rand()*1)
  76.246 ns (0 allocations: 0 bytes)
1.0000915788150249 - 0.045999085366594im

julia> @btime besselk_ratio_knu_knup1(v, x) setup=(x= 7.0im; v = rand()*1)
  141.996 ns (0 allocations: 0 bytes)
1.001602364037009 - 0.11366618431295909im

So a range between 76-142 nanoseconds in the regime we'd utilize this for even with complex arithmetic... Of course we also need recurrence up to higher order expansion (takes around 20 nanoseconds) and then need one large order evaluation...

julia> @btime Bessels.besseljy_debye(y, x) setup=(x=10*rand()+12.0; y=rand()+60.0)
  49.393 ns (0 allocations: 0 bytes)
(1.8133319226273097e-26, -3.0253831328694315e23)

Which takes around 50 nanoseconds. So that puts the total time for this approach around 150-225 nanoseconds in a region where no expansion gives good results. Now that might not seem that fast but the existing approach using Chebyshev approximation is the longest approach we have...

julia> @btime Bessels.bessely_chebyshev(y, x) setup=(x=12*rand()+6; y=rand()*20.0)
  460.577 ns (0 allocations: 0 bytes)

Which takes between 460-500 nanoseconds so the new approach would be about 2-3x faster.

I think it might be worth taking a look at the whole routine to some degree and try to decrease the amount of algorithms to improve branch prediction.

However, there was some discussion on discourse about this but it might be possible to improve the Chebyshev evaluation using SIMD.... I'd assume it would be possible to improve those times by 2-3x

Edit

The relation between besselk and bessely through the Hankel function might require a bit more care. It's true that besselk(v+1, z) / besselk(v, z) = im * besselh1(v + 1, z) / besselh1(v, z) relating that easily to bessely(v+1, z) / bessely(v, z) isn't as straight forward without computing besselj.

So we could compute besselh1 this way as we have the higher order expansion to compute besselh1 with large orders. The challenge though is that forward recurrence isn't stable in the real part of besselh1 whereas the imaginary part is. I believe we would actually need both parts for the normalization condition so this wouldn't necessarily work...

Using the relation bessely(v, x) = -cispi(-v/2) * (conj(besselk(v, x*im)) + cispi(v) * besselk(v, im*x)) / pi might be helpful here if we can calculate besselk. So maybe it is worth starting with computing besselk in the complex plane... The issue is that the along the imaginary axis the debye expansion doesn't converge well. Will need to think more about that.

Edit 2

After some more consideration it might be best to compute Hankel function directly then just take bessely from that.
We can use the relation besselk(v+1, x) / besselk(v, x) = im * hankelh1(v+1, z * im) / hankelh1(v, z * im) and use the continued fraction approach for besselk.

Similar to what I showed above we can construct the solution to the Hankel function with the continued fraction and the solution for higher order term.

# computes Hankel function
# compute bessely from imaginary part
function test(nu, x)
    fpart, _ = modf(nu)
    nu_ceil = 45.0 + fpart
    knu_ceil = complex(Bessels.besseljy_debye(nu_ceil, x)...)
    knup1 = -im * conj(Bessels.besselk_ratio_knu_knup1(nu, x*im))
    return knu_ceil / Bessels.besselj_up_recurrence(x, knup1, one(knup1), nu + 1, nu_ceil)[1]
end

julia> test(2.5, 18.0)
0.11922846888645108 + 0.14657028641952727im

julia> besselj(2.5, 18.0)
0.11922846888645013

julia> bessely(2.5, 18.0)
0.1465702864195262

Even with the code base as is this is pretty fast.

julia> @benchmark test(2.5, x) setup=(x=6.0 + 10.0*rand())
BenchmarkTools.Trial: 10000 samples with 690 evaluations.
 Range (min  max):  181.341 ns  285.023 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     183.636 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   189.049 ns ±  10.150 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▄█  ▃▁▃              ▄▆▁▆  ▂ ▁   ▃▁▃                        ▂
  ████▆▆█████▇▇▄▁▄▄▃▄▁▅▁▄████▇▆█▇█▇▆▆███▇▄▅▆▆▅▄▅▄▃▄▁▄▁▃▁▁█▇▄███ █
  181 ns        Histogram: log(frequency) by time        225 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So that's much faster than the current approach. I think some more care will help the convergence by only doing continued fraction for small orders and scaling from there.

The question to me is the stability of the recurrence. For bessely i'm not as worried because the forward recurrence is stable but I think this just needs some more testing but this is pretty nice because it again depends on the accuracy and speed of the same expansion that a lot of the routine does. So improving that will improve the whole routine....

@heltonmc
Copy link
Member Author

@cgeoga To continue this discussion a little bit because I think it would be really nice to round out the modified bessel functions completely. I've also been thinking a lot about how to ensure the implementations are differentiable. Enzyme has come a long way on the main branch and can give pretty fast derivatives of the code in Bessels.jl already.

Anyway for computation of modified bessel functions (real arguments for now) there are really three main approaches in order of precedence.

  1. large argument expansion
  2. power series
  3. uniform large argument expansion for large orders

Now the power series for besseli is always numerically stable. And with some of the recent work with SIMD we should probably unroll this loop by a factor of four and use SIMDMath.jl to make sure that vectorizes. That will make computation of the power series much more efficient. But where all of those methods fail (let's call it the intermediate zone) we need a good fourth algorithm. At first it doesn't seem like there is a great method but this thread mentions two:

  1. besseli power series coupled with continued fraction for besselk(v+1)/besselk(v) and wronskian.
  2. The slightly modified Miller's scheme to use with forward recurrence and continued fraction with normalizing condition with large argument expansion

Those methods advantages and disadvantages were extensively discussed above but essentially method 1 converges faster for smaller arguments while method two converges faster for larger arguments. The slight disadvantage of method 2 is that the accuracy is tied to the uniform expansions which is slightly less accurate (due to roundoff) than the power series. Though approach 1 is not feasible for complex numbers.

I'm going to introduce a different approach that could be a more robust approach meaning that it will work for complex numbers and be accurate. This was inspired by https://arxiv.org/pdf/1112.0072.pdf when I was working on the airy functions but approach three is:

  1. Sequence transformation (Levin's approach) of the asymptotic expansions

At first I ignored these approaches because they honestly just didn't seem worth it and not efficient at all. Essentially they are able to sum a diverging asymptotic series and increase the validity of the asymptotic expansions. Computationally they can be computed using recursive schemes but they require significant more computation. Essentially you must compute a Levin table and compute down the counter diagonals of the Levin table. Now just thinking theoretically about this problem this requires at a minimum N*(N+1)/2 muladd operations. I ran some tests and typically you need about 16-24 terms to get this to work. So for 16 terms you need 136 muladd operations and for 24 terms you need 300 muladds. And this is at a minimum. Not to mention the sequence transformation is essentially a more advanced Pade approximant so you must also compute a numerator and denominator so all those operations need to be doubled.

This is probably why I initially avoided this approach was that most of routines are under 100 ns and not only do you need all these arithmetics you also need to store tables and so your cache lines and memory accesses are honestly rubbish my first attempt at this using arrays took like 10 microseconds which is awful. But I want to come back to the performance after discussing where this is accurate.

So here are a few plots of the relative errors as a function of x at a specific nu value. I'll then plot the number N of terms used in each sequence transformation.
Screen Shot 2023-03-21 at 11 08 42 AM
Screen Shot 2023-03-21 at 11 11 26 AM
Screen Shot 2023-03-21 at 11 13 19 AM
Screen Shot 2023-03-21 at 11 16 29 AM

So the really cool thing here is we are summing the asymptotic expansion for large x but this sequence transformation allows us to sum a rapidly diverging series even when x << nu.

Ok for the benchmarks it's hard to say as the general form was so slow but I completely unrolled a version (which is probably not what we want) for N=16 but to really see the limits of the performance. Here is the code it is long (@oscardssmith maybe if you have an idea on how to organize the memory most effectively in a loop here 😄 ) ill work on optimizing the general case in a bit. But the benchmarks are pretty good.

julia> p = ntuple(i -> i*2.0, 18)
(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0, 22.0, 24.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0)

julia> @benchmark g1($p)
BenchmarkTools.Trial: 10000 samples with 990 evaluations.
 Range (min  max):  45.328 ns  84.385 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     45.455 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.635 ns ±  1.419 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▆▄▂                                   ▁                   ▁
  █████▄▁▁▁▁▁▁▃▁▁▁▃▁▄▁▃▃▁▃▅▃▁▁▁▁▃▃▁▄▁▁▁▁▃▇██▇▄▄▁▃▃▁▃▃▄▁▁▄▃▃▅▅ █
  45.3 ns      Histogram: log(frequency) by time      49.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And the advantage is this isn't awful for complex numbers because we have only real/complex multiplies and complex/complex additions.

julia> p = ntuple(i -> i*2.0 + rand()*im, 18)
(2.0 + 0.20671806918378222im, 4.0 + 0.5544791253581327im, 6.0 + 0.8743485225845774im, 8.0 + 0.5842694164427652im, 10.0 + 0.5376242211139655im, 12.0 + 0.24931450428480173im, 14.0 + 0.7889137184169184im, 16.0 + 0.18129161121223591im, 18.0 + 0.9320649223876125im, 20.0 + 0.4467818886051006im, 22.0 + 0.44671130589145225im, 24.0 + 0.09229756644375375im, 26.0 + 0.09649719037936122im, 28.0 + 0.7052070903007531im, 30.0 + 0.8544141179769782im, 32.0 + 0.35520748181052775im, 34.0 + 0.22437122283477695im, 36.0 + 0.9058299917597746im)

julia> @benchmark g1($p)
BenchmarkTools.Trial: 10000 samples with 832 evaluations.
 Range (min  max):  147.886 ns  208.534 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     148.689 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   149.701 ns ±   3.756 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▆▂     ▁▃▄▂▁▁ ▁▁                                           ▂
  █████▇▇█▇▇██████████▇▆▅▆▆▅▆▆▇▆▅▅▅▃▆▆▆▄▅▄▅▅▄▅▁▅▆▅▃▅▆▆▆▄▆▅▅▅▆▄▅ █
  148 ns        Histogram: log(frequency) by time        168 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

const CC =  ((1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0),
(0.5, 0.6, 0.6666666666666666, 0.7142857142857143, 0.75, 0.7777777777777778, 0.8, 0.8181818181818182, 0.8333333333333334, 0.8461538461538461, 0.8571428571428571, 0.8666666666666667, 0.875, 0.8823529411764706, 0.8888888888888888),
(0.32, 0.4166666666666667, 0.4897959183673469, 0.546875, 0.5925925925925926, 0.63, 0.6611570247933884, 0.6875, 0.7100591715976331, 0.7295918367346939, 0.7466666666666667, 0.76171875, 0.7750865051903114, 0.7870370370370371),
(0.23148148148148148, 0.31486880466472306, 0.3828125, 0.438957475994513, 0.486, 0.5259203606311045, 0.5601851851851852, 0.589895311788803, 0.6158892128279884, 0.6388148148148148, 0.6591796875, 0.6773865255444739, 0.693758573388203),
(0.17992503123698458, 0.251220703125, 0.31214753848498705, 0.3645, 0.4098080732190424, 0.44931520061728397, 0.48401666608312033, 0.514707413577676, 0.5420246913580247, 0.5664825439453125, 0.5884987009255157, 0.6084152568206066),
(0.14654541015625, 0.20809835898999135, 0.26244, 0.3104606615295776, 0.35303337191358025, 0.3909365379902126, 0.4248378651752246, 0.45530074074074073, 0.48279762268066406, 0.5077243694259351, 0.5304133008179648),
(0.12331754606814303, 0.177147, 0.22578957202151098, 0.2696782702117627, 0.30931242566258577, 0.34518076545487, 0.3777309849108368, 0.4073604941368103, 0.4344165727708536, 0.4592004039488861),
(0.1062882, 0.15394743546921202, 0.1977640648219593, 0.237932635125066, 0.27473571128040675, 0.30848030434385004, 0.3394670784473419, 0.3679763910529583, 0.39426297308742747),
(0.09330147604194668, 0.13596279456509702, 0.17570409978466411, 0.21259310991936234, 0.24678424347508002, 0.27846908778883517, 0.30784952976979524, 0.33512352712431337),
(0.08308837445644818, 0.12164129985092131, 0.15792631022581202, 0.19194330048061778, 0.22376980268745683, 0.25352314216336075, 0.28133826968460873),
(0.07485618452364388, 0.10998439462154766, 0.14331766435886129, 0.17482015834957565, 0.20452287098892968, 0.2324948200865864),
(0.06808557762286284, 0.1003223650512029, 0.13111511876218174, 0.1604100948932782, 0.18821009245104614),
(0.06242280492074847, 0.09219031787965903, 0.12077936556670357, 0.14812831350313815),
(0.057618948674786896, 0.08525602275296724, 0.11191917020237105),
(0.053493975060685324, 0.07927607889334616),
(0.049914568192106844,))


function g1(s)
    b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17 = 1 / (s[2] - s[1]), 1 / (s[3] - s[2]), 1 / (s[4] - s[3]), 1 / (s[5] - s[4]), 1 / (s[6] - s[5]), 1 / (s[7] - s[6]), 1 / (s[8] - s[7]), 1 / (s[9] - s[8]), 1 / (s[10] - s[9]), 1 / (s[11] - s[10]), 1 / (s[12] - s[11]), 1 / (s[13] - s[12]), 1 / (s[14] - s[13]), 1 / (s[15] - s[14]), 1 / (s[16] - s[15]), 1 / (s[17] - s[16]), 1 / (s[18] - s[17])
    a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15, a16, a17 = s[1] * b1, s[2] * b2, s[3] * b3, s[4] * b4, s[5] * b5, s[6] * b6, s[7] * b7, s[8] * b8, s[9] * b9, s[10] * b10, s[11] * b11, s[12] * b12, s[13] * b13, s[14] * b14, s[15] * b15, s[16] * b16, s[17] * b17

    index = 1

    a1 = muladd(a1, -CC[index][1], a2)
    b1 = muladd(b1, -CC[index][1], b2)

    a2 = muladd(a2, -CC[index][2], a3)
    b2 = muladd(b2, -CC[index][2], b3)

    a3 = muladd(a3, -CC[index][3], a4)
    b3 = muladd(b3, -CC[index][3], b4)

    a4 = muladd(a4, -CC[index][4], a5)
    b4 = muladd(b4, -CC[index][4], b5)

    a5 = muladd(a5, -CC[index][5], a6)
    b5 = muladd(b5, -CC[index][5], b6)

    a6 = muladd(a6, -CC[index][6], a7)
    b6 = muladd(b6, -CC[index][6], b7)

    a7 = muladd(a7, -CC[index][7], a8)
    b7 = muladd(b7, -CC[index][7], b8)

    a8 = muladd(a8, -CC[index][8], a9)
    b8 = muladd(b8, -CC[index][8], b9)

    a9 = muladd(a9, -CC[index][9], a10)
    b9 = muladd(b9, -CC[index][9], b10)

    a10 = muladd(a10, -CC[index][10], a11)
    b10 = muladd(b10, -CC[index][10], b11)

    a11 = muladd(a11, -CC[index][11], a12)
    b11 = muladd(b11, -CC[index][11], b12)

    a12 = muladd(a12, -CC[index][12], a13)
    b12 = muladd(b12, -CC[index][12], b13)

    a13 = muladd(a13, -CC[index][13], a14)
    b13 = muladd(b13, -CC[index][13], b14)

    a14 = muladd(a14, -CC[index][14], a15)
    b14 = muladd(b14, -CC[index][14], b15)

    a15 = muladd(a15, -CC[index][15], a16)
    b15 = muladd(b15, -CC[index][15], b16)

    a16 = muladd(a16, -CC[index][16], a17)
    b16 = muladd(b16, -CC[index][16], b17)

    index = 2

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]

    a10 = a11 - a10*CC[index][10]
    b10 = b11 - b10*CC[index][10]

    a11 = a12 - a11*CC[index][11]
    b11 = b12 - b11*CC[index][11]

    a12 = a13 - a12*CC[index][12]
    b12 = b13 - b12*CC[index][12]

    a13 = a14 - a13*CC[index][13]
    b13 = b14 - b13*CC[index][13]

    a14 = a15 - a14*CC[index][14]
    b14 = b15 - b14*CC[index][14]

    a15 = a16 - a15*CC[index][15]
    b15 = b16 - b15*CC[index][15]


    index = 3

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]

    a10 = a11 - a10*CC[index][10]
    b10 = b11 - b10*CC[index][10]

    a11 = a12 - a11*CC[index][11]
    b11 = b12 - b11*CC[index][11]

    a12 = a13 - a12*CC[index][12]
    b12 = b13 - b12*CC[index][12]
    
    a13 = a14 - a13*CC[index][13]
    b13 = b14 - b13*CC[index][13]

    a14 = a15 - a14*CC[index][14]
    b14 = b15 - b14*CC[index][14]

    index = 4

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]

    a10 = a11 - a10*CC[index][10]
    b10 = b11 - b10*CC[index][10]

    a11 = a12 - a11*CC[index][11]
    b11 = b12 - b11*CC[index][11]

    a12 = a13 - a12*CC[index][12]
    b12 = b13 - b12*CC[index][12]
    
    a13 = a14 - a13*CC[index][13]
    b13 = b14 - b13*CC[index][13]
    
    index = 5

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]

    a10 = a11 - a10*CC[index][10]
    b10 = b11 - b10*CC[index][10]

    a11 = a12 - a11*CC[index][11]
    b11 = b12 - b11*CC[index][11]

    a12 = a13 - a12*CC[index][12]
    b12 = b13 - b12*CC[index][12]

    index = 6

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]

    a10 = a11 - a10*CC[index][10]
    b10 = b11 - b10*CC[index][10]

    a11 = a12 - a11*CC[index][11]
    b11 = b12 - b11*CC[index][11]


    index = 7

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]

    a10 = a11 - a10*CC[index][10]
    b10 = b11 - b10*CC[index][10]


    index = 8

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    a9 = a10 - a9*CC[index][9]
    b9 = b10 - b9*CC[index][9]


    index = 9

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    a8 = a9 - a8*CC[index][8]
    b8 = b9 - b8*CC[index][8]

    index = 10

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    a7 = a8 - a7*CC[index][7]
    b7 = b8 - b7*CC[index][7]

    index = 11

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    a6 = a7 - a6*CC[index][6]
    b6 = b7 - b6*CC[index][6]

    index = 12

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    a5 = a6 - a5*CC[index][5]
    b5 = b6 - b5*CC[index][5]

    index = 13

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    a4 = a5 - a4*CC[index][4]
    b4 = b5 - b4*CC[index][4]

    index = 14

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    a3 = a4 - a3*CC[index][3]
    b3 = b4 - b3*CC[index][3]

    index = 15

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    a2 = a3 - a2*CC[index][2]
    b2 = b3 - b2*CC[index][2]

    index = 16

    a1 = a2 - a1*CC[index][1]
    b1 = b2 - b1*CC[index][1]

    return a1 / b1

end

@oscardssmith
Copy link
Member

Isn't this just 16 pairs of evalpolys of degrees 1 to 16?

@cgeoga
Copy link
Contributor

cgeoga commented Mar 21, 2023

This is insanity. But I'm into it. Would you mind sharing your un-optimized g1? I would appreciate taking a look at that to get a better idea of what's going on. Will also take a look at this paper---this is super interesting. I haven't actually seen this Levin transformation before at all.

Also, you mention differentiability with Enzyme---that's pretty cool! Would you also mind sharing those snippets/tests? I just tried in a fresh environment pulling the most recent versions and didn't have much luck for any routines that weren't the fixed order variety (besselk0, besselk1, etc).

@heltonmc
Copy link
Member Author

Isn't this just 16 pairs of evalpolys of degrees 1 to 16?

Yes - but I think the difference here is that you have to store the n-1 values after each iteration so it is more of a cumsum in this regard but I'll look into this to see more. The paper that describes this approach is https://arxiv.org/pdf/math/0306302.pdf (section 7) they actually gave a fortran snippet in 7.5 but I didn't look at it much. @cgeoga I'd highly recommend giving that paper a read it has a lot of very good stuff. @oscardssmith the Levin table is described in 7.5-1 computed with a 3 term recurrence. The code I listed above was the fastest way I could come up with doing that recursive tree type problem (though much of my time was spent just getting this to work)

This is insanity. But I'm into it.

Haha ya me too 😄 The code was originally based on https://github.com/MikaelSlevinsky/SequenceTransformations.jl and once you peel away all the generated functions and what have you you get a code that looks like...

function g(s)
    n = length(s) - 1
    l = similar(s)
    N = similar(l)
    D = similar(N)
    
    for k in 1:n
        w = s[k+1] - s[k]
        N[k], D[k] = s[k] / w, 1.0 / w
        kp1 = k + 1.0
        kdkp1 = k/kp1
        kdkp1jm2 = 1/kdkp1
        for j in 1:k-1
            cst = -(kp1-j)*kdkp1jm2/kp1
            N[k-j] = muladd(cst,N[k-j],N[k-j+1])
            D[k-j] = muladd(cst,D[k-j],D[k-j+1])
            kdkp1jm2 *= kdkp1
        end
        l[k] = N[1]/D[1]
    end
    return l
end

Of course I decided to just completely unroll everything and precompute the cst variable.

I just tried in a fresh environment pulling the most recent versions and didn't have much luck for any routines that weren't the fixed order variety (besselk0, besselk1, etc).

Ya I filed a few issues a couple of weeks ago and Billy fixed them on main. They are working on a new rule system I believe so it's been a while since they have released a tagged version so you'll have to try on Enzyme#main.

julia> @benchmark besselj(v, 1.2) setup=(v=rand())
BenchmarkTools.Trial: 10000 samples with 981 evaluations.
 Range (min  max):  62.096 ns  102.404 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     67.406 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.869 ns ±   2.674 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▄▆▇▅                             ▅█▆▂                ▁▂  ▂
  ▅████████▄▄▃▁▁▁▄▄▄▃▃▁▄▁▅███▆▆▃▁▁▁▃▃▅█████▅▄▃▁▄▁▁▁▄▄▃▄▁▃▁▇███ █
  62.1 ns       Histogram: log(frequency) by time      70.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark autodiff(Forward, v -> besselj(v, 1.2), Duplicated, Duplicated(2.4, 1.0))
BenchmarkTools.Trial: 10000 samples with 952 evaluations.
 Range (min  max):  94.625 ns  137.693 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     94.889 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   95.189 ns ±   1.758 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▇▄▁                             ▂▃▂                      ▂
  ▇██████▅▁▁▃▄▁▄▁▅▁▁▁▁▄▄▁▁▁▃▃▄▁▄▃▃▁▁▅▇████▄▄▁▃▃▃▃▁▁▄▃▁▄▄▅▆▆▆▆▆ █
  94.6 ns       Histogram: log(frequency) by time      99.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 

Here is the code working on besselj with respect to order on the main branch. Honestly that is pretty fast all things considered.

@oscardssmith
Copy link
Member

Yeah, my suggestion was the hybrid of this looped verison with your unrolled version (where you precompute cst but otherwise leave it looped). I think that should vectorize well and be fast.

@heltonmc
Copy link
Member Author

Ya I think my initial attempt at this removing that cst steered me away from this approach considering the n(n+1)/2 growth in operations for the sum. Just an example code..

function g(s)
    n = length(s) - 1
    l = similar(s)
    N = similar(l)
    D = similar(N)
    
    for k in 1:n
        w = s[k+1] - s[k]
        N[k], D[k] = s[k] / w, 1.0 / w
        for j in 1:k-1
            N[k-j] = muladd(-2.1,N[k-j],N[k-j+1])
            D[k-j] = muladd(-2.1,D[k-j],D[k-j+1])
        end
        l[k] = N[1]/D[1]
    end
    return l
end
julia> p = rand(16);

julia> @benchmark g($p)
BenchmarkTools.Trial: 10000 samples with 200 evaluations.
 Range (min  max):  406.665 ns   2.451 μs  ┊ GC (min  max): 0.00%  82.54%
 Time  (median):     416.250 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   423.480 ns ± 92.645 ns  ┊ GC (mean ± σ):  1.09% ±  4.06%

     ▁▃▅▆████▇▆▅▄▃▂▁▁▁▂▂▂▂▂       ▁▁▁▁                         ▂
  ▅▆█████████████████████████▇█▇▇███████▇▇▇▆▅▆▆▆▆▅▆▆▅▂▄▄▅▃▄▄▆▅ █
  407 ns        Histogram: log(frequency) by time       466 ns <

 Memory estimate: 576 bytes, allocs estimate: 3.

julia> p = rand(20);

julia> @benchmark g($p)
BenchmarkTools.Trial: 10000 samples with 171 evaluations.
 Range (min  max):  632.795 ns    2.940 μs  ┊ GC (min  max): 0.00%  77.83%
 Time  (median):     647.421 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   656.425 ns ± 106.955 ns  ┊ GC (mean ± σ):  0.79% ±  3.76%

       ▂▂▄▆▇█▇▇▆▄▃▁▁▁▁▁▃▃▃▂▁▁    ▁▁▁▂                           ▂
  ▂▄▅▆▇███████████████████████████████▇▇▇▇▅▆▆▆▆▆▅▆▅▅▄▅▄▄▂▅▄▃▅▄▄ █
  633 ns        Histogram: log(frequency) by time        713 ns <

 Memory estimate: 672 bytes, allocs estimate: 3.

julia> p = rand(24);

julia> @benchmark g($p)
BenchmarkTools.Trial: 10000 samples with 45 evaluations.
 Range (min  max):  908.333 ns   10.032 μs  ┊ GC (min  max): 0.00%  90.52%
 Time  (median):     934.244 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   945.136 ns ± 218.800 ns  ┊ GC (mean ± σ):  0.56% ±  2.21%

        ▁▄▄▅▆▇███▇▆▄▂▁    ▁▃▃▄▃▃▃▁▁                             ▂
  ▄▃▄▄▆▇██████████████▇▅▆▇██████████████▇█▇▇█▇█▇▇█▆▅▅▄▃▄▅▅▅▄▆▅▆ █
  908 ns        Histogram: log(frequency) by time       1.02 μs <

 Memory estimate: 768 bytes, allocs estimate: 3.

So the unrolled tuple version is over 100x faster (of course this is much more memory friendly)

@cgeoga
Copy link
Contributor

cgeoga commented Mar 21, 2023

Would that vectorize better than a fully unrolled one? Looking at the crazy g1, it looks to me that you could write that function without a ton of repetition using Base.Cartesian.@nexprs for the declaration and the loops and stuff.

@heltonmc
Copy link
Member Author

heltonmc commented Mar 21, 2023

Both of the versions vectorize in a sense that the numerator and denominator are computed simultaneously. There is an opportunity to further vectorize this if we want to store two different arrays with offset indexes that allow for computation of the whole vector but if we flush the SIMD registers for just a numerator (say we reduce the total computations by 4x) we still have to compute that twice for denominator and numerator since that is not going to be able to vectorize well.

Though I don't think it becomes worth it to go much further than to SIMD the numerator and denominator (which is the natural choice) because no matter what we have to compute 16->15->14 multiplies so if we unroll by a factor of 4 that means we will have 4->4->4->4->3.... multiply structure. As we will have excess multiples in the registers that won't quite fill them all. So if the naive case is (n*(n+1)/2) multiplies (so 136 for n=16) then the perfect SIMD version would have (4 * 4 + 4 * 3 ...) so 40 multiplies. Now the naive case for the 136 non SIMD version will vectorize the numerator and denominotor but if we flush completely the single evaluation then we have 40*2 so the SIMD has a perfect case for 80 multiplies compared to 136.

That's a lot but it starts to become not worth it because we have to store additional vectors and then have them packed efficiently I doubt that it will help much when all said and done. I initially tried it but I then said what happens when we move to the complex domain? It will make more sense to keep the extra SIMD registers to compute those complex additions better.

t looks to me that you could write that function without a ton of repetition using Base.Cartesian.@nexprs

I think you are right here. My initial attempt was not type stable as the tuple length changed in each iteration so need to be aware of that.

@cgeoga
Copy link
Contributor

cgeoga commented Mar 21, 2023

Wow, well honestly a lot of that is pretty over my head here. I have never really taken SIMD seriously. I'm trying to write the @nexprs equivalent of your g1 and I have one stumper:

function g1_gen(s)
  # declare the a and b scalar variables:
  Base.Cartesian.@nexprs 17 j->(b_j = inv(s[j+1] - s[j]))
  Base.Cartesian.@nexprs 17 j->(a_j = b_j*s[j])
  # first sequence of operations for index=1:
  CC1 = CC[1]
  Base.Cartesian.@nexprs 16 j->begin
    a_j = muladd(a_j, -CC1[j], a_{j+1})
    b_j = muladd(a_j, -CC1[j], b_{j+1})
  end
  # outer loop over "index". 
  # STUMPER: Is there some way to make sure that this gets
  # expanded before the "inner" @nexprs, so that {16-index_m1} expands to the
  # raw int before the inner call to @nexprs?
  Base.Cartesian.@nexprs 15 index_m1->begin
    index = index_m1+1
    # inner loop over the variable index:
    Base.Cartesian.@nexprs {16-index_m1} j->begin
      a_j = a_{j+1}*CC[index][j]
      b_j = b_{j+1}*CC[index][j]
    end
  end
  a_1/b_1
end

But if there is some trick to make that outer @nexprs expand first, then I think this will give what you want. I'm sure it will need to be workshopped plenty, but just playing around for fun. I tried putting a @macroexpand on the outer @nexprs, but that didn't change anything.

@heltonmc
Copy link
Member Author

Thanks for the idea 😄 I think this can go into an example of the what the heck / or amazing examples of julia but

function g1(s)
    @nexprs 17 i -> b_{i} = 1 / (s[i + 1] - s[i])
    @nexprs 17 i -> a_{i} = s[i] * b_{i}
    @nexprs 16 index -> (@nexprs 17-index i -> (a_{i}, b_{i}) = (muladd(a_{i}, -CC[index][i], a_{i+1}), muladd(b_{i}, -CC[index][i], b_{i+1})))
    return a_1 / b_1
end

I believe that is correct (at least it returns correct answer ) and what you were looking for in like three lines of code!

julia> @benchmark g1($(0.11274150564997776, 0.11475646021904119, 0.11370868384312821, 0.11451023277070166, 0.11369906525599734, 0.11472113632452476, 0.11317926911257481, 0.11588910073757686, 0.11045137194340607, 0.12271888810305538, 0.09198318397942494, 0.17666004884002684, -0.07776136204113551, 0.7501986007978472, -2.150973108989948, 8.739300196625988, -34.86022928456346, 150.58310277542904))
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min  max):  41.835 ns  73.957 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     41.961 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.132 ns ±  1.291 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃▂                                                       ▁
  █████▅▁▁▁▁▁▁▁▄▁▁▁▃▃▁▃▃▄▄▄▄▃▅▄▁▁▁▄▃▁▁▄▁▅███▅▅▇▄▁▁▁▁▃▄▁▅▃▁▃▃▄ █
  41.8 ns      Histogram: log(frequency) by time        46 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@heltonmc
Copy link
Member Author

I think we could use a generated function as well based on the NTuple length of N.

@oscardssmith I'm curious what you feel like including code like that in the package considering compile times / package load times / cache pressure? The generated llvm code is immense and how does that affect other code who may call besselk amongst other calls within a larger function. It is definitely very very fast in these microbenchmarks...

@oscardssmith
Copy link
Member

in terms of speed, the generated function + nexprs approach will be really good. and I don't think the cache pressure/load times will be much worse than the vectorized evalpolys (which also unroll). The one thing you could do that would help is generating the n^2 nexprs as vectorized muladds in the first place (since doing so will reduce the number of instructions by 4x/8x).

@cgeoga
Copy link
Contributor

cgeoga commented Mar 21, 2023

I can't believe it. I just had to stop trying to be clever with the {} or the $() and just write the raw difference! Ugh. I still feel like the curly brace definition should have worked based on the docs for @nexprs. But anyways, your version looks pretty slick! And I'm glad that the perf aspect is appealing.

@heltonmc
Copy link
Member Author

heltonmc commented Mar 21, 2023

Haha ya one thing about these type of expressions is they aren't forgiving :) but ya this is really elegant. I like it a lot and it's very fast. I would think it would be good to move forward with this type of approach. I will work on Oscar's suggestion a little bit to reduce the number of instructions but here is a fully general version of Levin's sequence transformation (which works for any N). And also because we are unrolling it like this we can just put the constant factor within the function at no cost which helps it be fully general.

@inline levin_scale(B, n, k) = -(B + n) * (B + n + k)^(k - 1.0) / (B + n + k + 1.0)^k

@generated function levin_transform(s::NTuple{N, Float64}) where N
    len = N-1
    len2 = N-2
    :(
        begin
            @nexprs $len i -> b_{i} = 1 / (s[i + 1] - s[i])
            @nexprs $len i -> a_{i} = s[i] * b_{i}
            @nexprs $len2 k -> (@nexprs ($len-k) i -> (a_{i}, b_{i}) = (muladd(a_{i}, Levin_scale(1.0, i, k-1), a_{i+1}), muladd(b_{i}, Levin_scale(1.0, i, k-1), b_{i+1})))
            return a_1 / b_1
        end
    )
end

One thing to note is we aren't quite done yet is that this is a sequence transform so we need to generate efficiently the sequence (cumsum) of our series definition. Shouldn't be too time consuming I think the straightforward approach of going from series definition to sequence will work just fine (we will also have a better estimate of our weight function here going from the series definition).

@heltonmc
Copy link
Member Author

Ok here is an actual version that we can test and improve to compute besselk using the Levin's sequence transformation. It computes the partial sums and weights on the fly and can be run for any length of order.

This version works with SIMDMath.jl now to explicitly vectorize all these calculations. O and I have a version that works for complex numbers!

using Base.Cartesian
using SIMDMath: VE, muladd, LVec, FloatTypes

@inline levin_scale(B::T, n, k) where T = -(B + n) * (B + n + k)^(k - one(T)) / (B + n + k + one(T))^k

# compute the scaled modified bessel function using the Levin-u transformation
# v = 0.2; x = 4.3
# besselkx(v, x, Val(19))
@generated function besselkx_levin(v, x::T, ::Val{N}) where {T <: FloatTypes, N}
    len = N-1
    len2 = N-2
    :(
        begin
            # generate a sequence of partial sums and weights of the asymptotic expansion for large argument besselk
            l = let (out, term, fv2) = (one(typeof(x)), one(typeof(x)), 4*v^2)
                @ntuple $N i -> begin
                    offset = muladd(2, i, -1)
                    term *= muladd(offset, -offset, fv2) / (8 * x * i)
                    out += term
                    # this check is relatively cheap but if the normal sum converges then we should return value
                    abs(term) <= eps(T) && return out * sqrt/ 2x)
                    # need to return the weights and weighted partial sums
                    invterm = inv(term)
                    LVec{2, T}((out * invterm, invterm))
                end
            end
            # this is required because you can't set the index of a tuple
            @nexprs $len i -> a_{i} = l[i]
            @nexprs $len2 k -> (@nexprs ($len-k) i -> a_{i} = muladd(a_{i}, levin_scale(one(T), i, k-1), a_{i+1}))
            return (a_1[1].value / a_1[2].value) * sqrt/ 2x)
        end
    )
end

@generated function besselkx_levin(v, x::Complex{T}, ::Val{N}) where {T <: FloatTypes, N}
    len = N-1
    len2 = N-2
    :(
        begin
            # generate a sequence of partial sums and weights of the asymptotic expansion for large argument besselk
            l = let (out, term, fv2) = (one(typeof(x)), one(typeof(x)), 4*v^2)
                @ntuple $N i -> begin
                    offset = muladd(2, i, -1)
                    term *= @fastmath muladd(offset, -offset, fv2) / (8 * x * i)
                    out += term
                    # this check is relatively cheap but if the normal sum converges then we should return value
                    (abs(real(term)) <= eps(T) && abs(imag(term)) <= eps(T)) && return out * sqrt/ 2x)
                    # need to return the weights and weighted partial sums
                    invterm = @fastmath inv(term)
                    LVec{4, T}((reim(out * invterm)..., reim(invterm)...))
                end
            end
            # this is required because you can't set the index of a tuple
            @nexprs $len i -> a_{i} = l[i]
            @nexprs $len2 k -> (@nexprs ($len-k) i -> a_{i} = muladd(a_{i}, levin_scale(one(T), i, k-1), a_{i+1}))
            return (complex(a_1[1].value, a_1[2].value) / complex(a_1[3].value, a_1[4].value)) * sqrt/ 2x)
        end
    )
end

besselk_levin(v, x::T, ::Val{N}) where {T <: Union{Complex{FloatTypes}, FloatTypes}, N} = exp(-x) * besselkx_levin(v, x, Val(N))

I've added some comments in there about what it is actually doing (though still somewhat terse) but I think this is ready to be workshopped a little bit. The real version will generate <2 x double> instructions where the complex will generate <4 x double> instructions. The main challenge with the complex code is the runtime is dominated by the divisions in the sequence generation not in the transformation.

Some examples and benchmarks.

julia> v = 0.2
0.2

julia> x = 4.3
4.3

julia> besselkx_levin(v, x, Val(16))
0.5911957440242439

julia> using SpecialFunctions

julia> besselkx(v, x)
0.591195744024244

julia> using BenchmarkTools

julia> @benchmark besselkx_levin($v, $x, Val(16))
BenchmarkTools.Trial: 10000 samples with 987 evaluations.
 Range (min  max):  50.411 ns  90.210 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     50.954 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.462 ns ±  4.162 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅ ▁                                                        ▁
  ██▁██▄▄▁▄▁▁▁▁▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃█ █
  50.4 ns      Histogram: log(frequency) by time      86.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And now for complex numbers!!!

julia> x = 4.3 + 1.1im
4.3 + 1.1im

julia> besselkx_levin(v, x, Val(16))
0.5783561826343409 - 0.06993526928128495im

julia> besselkx(v, x)
0.578356182634341 - 0.0699352692812849im

So you might think that the complex version will be much much slower considering all the divisions and stuff but because we are able to vectorize it pretty well it is only about 3x slower than the real version (I'm sure I can play around with the sequence generation to get a bit faster).

julia> @benchmark besselkx_levin($v, $x, Val(16))
BenchmarkTools.Trial: 10000 samples with 803 evaluations.
 Range (min  max):  155.110 ns  180.239 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     155.340 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   155.431 ns ±   0.667 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄▆███▆▄▁                                               ▁▁   ▂
  ▄█████████▃▁▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▇█████▇ █
  155 ns        Histogram: log(frequency) by time        158 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now that I'm looking at the code more we can just combine the real and complex version and have a branch because of the type information that will be eliminated at runtime for the versions.

@cgeoga
Copy link
Contributor

cgeoga commented Mar 29, 2023

Okay, not sure if this is the right place for it, but I've been playing with ForwardDiff AD on this routine and I have found one issue: half-integer orders. The reason for this is because the uniform and argument-based asymptotic expansions actually terminate after a finite number of terms for v+1/2 an integer. But of course that doesn't mean partial information doesn't continue. So the issue is with

[...]
abs(term) <= eps(T) && return out * sqrt/ 2x)
[...]

But if you just comment that out, then the inv(term) now blows up and you end up getting a NaN out once all is said an done.

My next naive thought is to just stop updating the things stored in the tuple when the series has converged:

[...]
if abs(term) > eps(T)
  out += term
  invterm = inv(term)
end
[...]

And just by putting that little branch there now the tuple values will just stay constant after term is zero.

BUT, the issue that still comes up is with the actual Levin transform stuff. And this is the part that I don't understand super well, but each of those steps starts at the first index, so even though the series is converged, you still get different numbers if you compare besselkx_levin(3.5, 4.0, Val(16)) with besselkx_levin(3.5, 4.0, Val(20)).

I don't think this is impossible, but just an update because it will probably be relevant in other places too. I think we just need to find a way to let terms be zero in peace. I'll keep tinkering with this for now though, unless your rewrite of the Levin code will fix this for free you think.

@heltonmc
Copy link
Member Author

Hmmmm... I actually don't think this should be an issue with the Levin transform in itself but the generation of the partial sums. For instance if we look at the expansion in NIST 10.40.2

function _besselk_large_argument(v, x::T) where T
    MaxIter = 50 
    S = eltype(x)
    v, x = S(v), S(x) 
 
    fv2 = 4 * v^2 
    term = one(S) 
    res = term 
    s = term 
    for i in 1:MaxIter
        @show res
        offset = muladd(2, i, -1) 
        term *= muladd(offset, -offset, fv2) / (8 * x * i) 
        res = muladd(term, s, res) 
        #abs(term) <= eps(T) && break 
    end 
    return res 
end

These terms just generate ones. Which when we plug in the weight function this becomes zeros for the weights. Now zeros for the weights will just generate NaNs really quickly. But what we need to have happen is to differentiate to the form given in 10.40.8 which I showed in the other issue before calculating the weights..... So to me if ForwardDiff can work with the regular asymptotic expansion it should also work here. Our partial sums we generate our just not right. This seems to be an issue with how I've set the code up and not in the actual transform itself? Will look more at this.

@cgeoga
Copy link
Contributor

cgeoga commented Mar 29, 2023

For sure, I'm with you about that---I just thought there was something about the Levin transform that required keeping that second invterm. I'll also poke around a bit more with this and I'm sure we can sort it out.

EDIT: okay, wrote that too quickly. We do need that for the Levin transform one way or another, and I'm pretty sure early termination will not give correct derivatives. This was the reason we started using the exponential improvement for the half-integer case in BesselK.jl. In the "normal" uniform asymptotic expansion, the fix is easy---just don't terminate and do it for whatever fixed number of terms anyway, even if they are zero when v isa Float64. But considering that there is this division in the Levin transform I do think that there is something to think about here.

@heltonmc
Copy link
Member Author

Could you also post code that produces wrong results? Like does it produce zeros or the wrong results and is this just for v=1/2 ? I just tried with enzyme and it seemed reasonable?

 julia> autodiff(Forward, v-> besselkx_levin(v, 12.0, Val(16)), Duplicated, Duplicated(0.5, 1.0))
(0.3618006272791338, 0.015075026136630575)

julia> besselkx(0.5, 12.0)
0.3618006272791339

julia> autodiff(Forward, v-> besselkx_levin(v, 12.0, Val(16)), Duplicated, Duplicated(1.5, 1.0))
(0.39195067955239493, 0.04710945667697055)

julia> besselkx(1.5, 12.0)
0.3919506795523951

@cgeoga
Copy link
Contributor

cgeoga commented Mar 30, 2023

I'll write more on this tomorrow, but unfortunately the answers that seem reasonable are often wrong for this edge case. Consider:

using FiniteDifferences, Bessels
const FDM = central_fdm(10,1)
function ref_dbesselkx_dv(v, x)
  f2 = Base.Fix2(Bessels.besselkx, x)
  FDM(f2, v)
end

ref_dbesselkx_dv(0.5, 12.0) # ~ 0.0145
ref_dbesselkx_dv(1.5, 12.0) # ~ 0.0470

So what you're seeing is not catastrophically inaccurate, but also only correct to a couple significant digits. Here is the version of your code that I am using to play with ForwardDiff (really just cosmetic modifications to fit my own formatting tastes):

@inline function levin_scale(B::T, n, k) where T 
  -(B + n) * (B + n + k)^(k - one(T)) / (B + n + k + one(T))^k
end

@generated function _besselkx_levin(v::V, x::T, ::Val{N}) where {V, T, N}
  len  = N-1
  len2 = N-2
  outT = promote_type(V, T)
  quote
    begin
      l = let (out, term, fv2, eightx) = (one($outT), one($outT), 4*v^2, 8*x)
        Base.Cartesian.@ntuple $N i -> begin
          offset = muladd(2, i, -1)
          term  *= muladd(offset, -offset, fv2) / (eightx*i)
          out   += term
          abs(term) <= 1e-15 && return out * sqrt/ 2x)
          invterm = inv(term)
          (out*invterm, invterm)
        end
      end
      Base.Cartesian.@nexprs $len i -> a_{i} = l[i]
      Base.Cartesian.@nexprs $len2 k -> begin
        Base.Cartesian.@nexprs ($len-k) i -> begin
          sc  = levin_scale(one(T), i, k-1)
          a_i = (muladd(a_{i}[1], sc, a_{i+1}[1]), 
                 muladd(a_{i}[2], sc, a_{i+1}[2]))
        end
      end
      return $outT((a_1[1]/a_1[2]) * sqrt/ 2x))
    end
  end
end

function our_dbesselkx_dv(v, x)
  f2 = _v -> _besselkx_levin(_v, x, Val(16))
  ForwardDiff.derivative(f2, v)
end

And again you see:

julia> ref_vs_ours(v, x) = (ref_dbesselkx_dv(v, x) - our_dbesselkx_dv(v, x))/abs(ref_dbesselkx_dv(v, x))
ref_vs_ours (generic function with 1 method)

julia> ref_vs_ours(1.5, 12.0)
-0.0020240634451386256

julia> ref_vs_ours(1.501, 12.0)
-2.9332885207854924e-13

The reason for this is the early return in the Levin code, and so if the series happens to terminate exactly (which occurs whenever v+1/2 is an integer), you hit that early return and you effectively differentiate just the asymptotic expansion itself, and with only ceil(v) many terms. And the reason THAT is wrong is because you could write $\mathcal{K}_{\nu}(x)$ as

$\mathcal{K}{\nu}(x)=\sum{j=1}^{\text{ceil}(\nu)}[...]+R_{\nu}(x)$

where $R$ is some remainder term that is zero when $\nu + 1/2$ is an integer. But $\partial_{\nu} R$ is certainly not zero even when $\nu + 1/2$ is an integer, and so this early return really cuts off an entire piece of the function.

Anyways, this comment is getting long, sorry. I do think this can be addressed, but it will probably in the end involve finding a way to accumulate coefficients even when the not-derivative part is 0, if you know what I mean.

EDIT: Ugh, I give up on trying to make that tex equation render. Hopefully it is still clear enough.

@heltonmc
Copy link
Member Author

So you are saying we should just use finite differences 😄 This is an interesting problem (and at least we seem to be tackling toughest case first. I just want to take one step back to make sure I see the problem. If we look at the regular asymptotic expansion

function besselk_large_argument(v, x::T) where T
    a = exp(-x / 2)
    coef = a * sqrt(pi / 2x)
    return T(_besselk_large_argument(v, x) * coef * a)
end
function _besselk_large_argument(v, x::T) where T
    MaxIter = 50 
    S = eltype(x)
    v, x = S(v), S(x) 
 
    fv2 = 4 * v^2 
    term = one(S) 
    res = term 
    s = term 
    for i in 1:MaxIter
        offset = muladd(2, i, -1) 
        term *= muladd(offset, -offset, fv2) / (8 * x * i) 
        res = muladd(term, s, res) 
        #abs(term) <= eps(T) && break 
    end 
    return res 
end

And I modify the code to just run for 50 terms no matter what we can get good derivatives? At least that seems to match ok.

julia> autodiff(Forward, v->besselk_large_argument(v, 55.0), Duplicated, Duplicated(0.5, 1.0))[2]
1.97876284377238e-27

julia> dvbesselk(ArbFloat(0.5), ArbFloat(55.0))
1.97876284377238030106336346605515496107532396354231328952625258552628280879127918412731333961880277672326758481184840407518354008994240482915254186001225691837201407244641556100665654134700311440503454914579578322988897030296403705253013030240425314170276324542186929853440167150195734242472842029798051876839218023694971818866179030774694381722849809203226358554462257075547987853989971279198759455911150377220806768520043019343563369383924297628343713389132011187097644750889728058214066854106251841856730816010027578080074991956856259721740611201807109150298939419064511338703147526835965006e-27

So that is right to every digit. It does seem like the derivative is more slowly converging though so need to be cautious with that termination criteria?

So the next question then is how does that translate to the sequence transform.... I'd be interested to see if we can separate out the derivative partial sums from the actual transform itself to see if we can feed in the right partial sums if that helps converge the sum. One thing to keep in mind is that the sequence transforms only work well for diverging alternating series. It looks like the derivative will satisfy but might have slightly different convergent characteristics?

@cgeoga
Copy link
Contributor

cgeoga commented Mar 30, 2023

Ha, I would definitely not suggest just doing finite differences. But yeah, what you're describing about just having a bunch of terms to compensate is exactly what we do in BesselK.jl. The issue is just that for the real intermediate cases like (v,x) = (3.5, 5.0), you will just need an absolute ton of terms. I think the series transform things are super interesting there, and I don't quite understand the Miller method enough to comment on any of it. But I think this strategy is super appealing and fixable. In fact, if we could have a sequence transform that didn't involve a term in S_{n+1} - S_n in the denominator that will be zero when the series converges in finitely many steps (that are only knowable at runtime), I think we'd be good to go. We just need to think about it for a little while I think. I'm confident that there will be a way to deal with this and keep the good tricks that give us such better accuracy per iteration in the intermediate ranges.

@heltonmc
Copy link
Member Author

heltonmc commented Mar 30, 2023

Love this problem 😄! Super interesting. The good news though is the sequence transform could drastically reduce the number of terms we need so I think this is still very promising especially if the derivative takes even more terms.

With that said I think this might be just an issue with how I originally coded the first problem If we take this all the way back (I'm just going to put the general code here).

function g(s)
           n = length(s) - 1
           l = similar(s)
           N = similar(l)
           D = similar(N)
           
           for k in 1:n
               w = s[k+1] - s[k]
               N[k], D[k] = s[k] / w, 1.0 / w
               kp1 = k + 1.0
               kdkp1 = k/kp1
               kdkp1jm2 = 1/kdkp1
               for j in 1:k-1
                   cst = -(kp1-j)*kdkp1jm2/kp1
                   N[k-j] = muladd(cst,N[k-j],N[k-j+1])
                   D[k-j] = muladd(cst,D[k-j],D[k-j+1])
                   kdkp1jm2 *= kdkp1
               end
               l[k] = N[1]/D[1]
           end
           return l
       end
function _besselk_large_argument(v, x::T, iter) where T
           MaxIter = 50 
           S = eltype(x)
           v, x = S(v), S(x) 
        
           fv2 = 4 * v^2 
           term = one(S) 
           res = term 
           s = term 
           for i in 1:iter
               offset = muladd(2, i, -1) 
               term *= muladd(offset, -offset, fv2) / (8 * x * i) 
               res = muladd(term, s, res) 
               #abs(term) <= eps(T) && break 
           end 
           return res  * sqrt(pi / 2x)
       end

Now what we need to realize is that the sequence transform will be able to handle any sequence but we don't necessarily need to take the derivative of the sequence transform we only need to take the derivative of the sequence then feed that sequence into the transform. I'm hoping that makes sense.

So essentially this is all we need to do...

v = 0.5; x = 5.0
# generate our partial sums but using the derivative
out = [autodiff(Forward, v->_besselk_large_argument(v, x, iter), Duplicated, Duplicated(v, 1.0))[2] for iter in 1:30];

# do the sequence transform on those derivative partial sums
julia> g(out)
30-element Vector{Float64}:
 0.05604991216397929
 0.051379086150314356
 0.05131596687985941
 0.0513217030639802
 0.051321108923874925
 0.051321175118541326
 0.05132116743538506
 0.051321168343865145
 0.051321168236451525
 0.05132116824892151
 0.051321168247529726
 0.051321168247674714
 0.051321168247661336
 0.051321168247662294
 0.05132116824766225
 0.05132116824766225
 0.05132116824766225
 0.05132116824766225
 0.051321168247662266
 0.05132116824766226
 0.05132116824766225
 0.05132116824766226
 0.05132116824766225
 0.05132116824766226
 0.05132116824766225
 0.051321168247662266
 0.05132116824766226
 0.05132116824766225
 0.051321168247662266

julia> dvbesselk(ArbFloat(v), ArbFloat(x))[2] * exp(ArbFloat(x))
0.05132116824766225501961245911721947725432532303036730970326023167364314434751562772143246443214094748722

So as you can see this only needs 14 terms to give us a derivative accurate to every digit!

Edit. And only about 18-20 terms for x=2.5

julia> v = 0.5; x = 2.5
julia> out = [autodiff(Forward, v->_besselk_large_argument(v, x, iter), Duplicated, Duplicated(v, 1.0))[2] for iter in 1:30];

julia> dvbesselk(ArbFloat(v), ArbFloat(x))[2] * exp(ArbFloat(x))
0.135087772677340579105023651731665921186421969870307985751490770136161243722610767275174444024496555435028836619099605088362489867464669721497571410861968552056029631007269326971586330043379328447709022231407430109494818834510767049995833607446153636099383460123617983835365629621156133602351560255916340784931099706116063704942997261114836463603880844812376501562359483149734726825274119823214982811149094563989864455340693045374377573412862809968997866980601751064617778251759629803335674838979665361128354655316750064362682923399534613558675702730058772076161233575816259345806490530088060237

julia> g(out)[1:20]
20-element Vector{Float64}:
 0.15853309190424045
 0.1358855073464918
 0.13498857330460076
 0.13510091101392382
 0.1350860461979384
 0.135087985897801
 0.1350877496735732
 0.13508777449654483
 0.13508777266063987
 0.13508777264824284
 0.1350877726838682
 0.135087772676591
 0.1350877726773422
 0.13508777267736022
 0.13508777267733627
 0.1350877726773409
 0.13508777267734065
 0.13508777267734057
 0.1350877726773406
 0.13508777267734062

@cgeoga
Copy link
Contributor

cgeoga commented Mar 30, 2023

Wow, okay, very nice! So we really just need to somehow get those derivatives evaluated before the acceleration computes inv(w). That is super doable.

@heltonmc
Copy link
Member Author

Yes - I think there's two things about this problem.

  1. Separate the code for the partial sums from the sequence transformation. Use autodiff on the partial sums then feed those into the sequence transform. The Levin function can then just be a general function for any sequence transformation that accepts partical sums sn and corresponding weights wn.

  2. How do we have a stopping criteria for: a. how many terms to use in the sequence of partial sums or b. how many terms to use in the iterative normal asymptotic expansion that converges for both the scalar evaluation and derivative.

To me the second point has to be some general problem with AD itself for iterative processes. I'm wondering if this is already solved? I'm not as familiar with AD honestly but the problem to me could be generally stated as for an iterative sum (with i terms) to what extent fi(x) converges to f(x) which we detect with our stopping criteria during the recursive sum implies that fi'(x) converges to f'(x). Ideally, AD would be able to check for both conditions.... but it seems like it just terminates when the first condition is true which generates awful estimates of the second condition.

This issue has to arise frequently? As I would mostly expect the sum of derivatives to be slower converging (at least require a couple more higher order terms). Is it possible to have two separate stopping criteria when AD is used?

@oscardssmith
Copy link
Member

Can the Implicit Function theorem be used here?

@cgeoga
Copy link
Contributor

cgeoga commented Apr 26, 2023

Okay, so after looking at the Enzyme docs for a while, the solution of compartmentalizing the Levin transformation from the series accumulation actually has a reasonably elegant solution. If I go back to an earlier version of the code that doesn't also collect invterm, I can do the Levin transform directly on the derivative of the series values like so:

using Bessels, Enzyme
using Enzyme.EnzymeRules
using Base.Cartesian: @nexprs, @ntuple

@inline levin_scale(B::T, n, k) where T = -(B + n + k) * (B + n + k - 1) / ((B + n + 2k) * (B + n + 2k - 1))

@generated function levin_transform(s::NTuple{N, Float64}) where N
    len = N-1
    len2 = N-2
    :(
        begin
            @nexprs $len i -> b_{i} = 1 / (s[i + 1] - s[i])
            @nexprs $len i -> a_{i} = s[i] * b_{i}
            @nexprs $len2 k -> (@nexprs ($len-k) i -> (a_{i}, b_{i}) = (muladd(a_{i}, levin_scale(1.0, i, k-1), a_{i+1}), muladd(b_{i}, levin_scale(1.0, i, k-1), b_{i+1})))
            return a_1 / b_1
        end
    )
end

# Here is the sauce: a manual rule that separately does the levin transform for
# the values and dvalues. This is because if v+1/2 is an integer, the
# sequences of _values_ will be exactly converged and you'll have a divide by
# zero NaN problem, but the _dvalues_ will NOT do that, and you want all those
# terms for the sake of accuracy.
function EnzymeRules.forward(func::Const{typeof(levin_transform)}, 
                             ::Type{<:Duplicated}, 
                             s::Duplicated)
  sv  = s.val
  dv  = s.dval
  N   = length(sv)
  ls  = (sv[N-1] == sv[N]) ? sv[N] : levin_transform(sv)
  dls = (dv[N-1] == dv[N]) ? dv[N] : levin_transform(s.dval)
  Duplicated(ls, dls)
end

@generated function besselkx_as(v, x::T, ::Val{N}) where {T, N}
  quote
    # generate a sequence of partial sums and weights of the asymptotic
    # expansion for large argument besselk
    tup = let (out, term, fv2) = (one(typeof(x)), one(typeof(x)), 4*v^2)
      @ntuple $N i -> begin
        offset  = muladd(2, i, -1)
        term   *= muladd(offset, -offset, fv2) / (8 * x * i)
        out    += term
        out
      end
    end
  end
end

@generated function besselkx_levin(v, x::T, ::Val{N}) where {T, N}
  :(levin_transform(besselkx_as(v, x, Val($N)))*sqrt(pi/2x))
end

# This allocates, so there may still be something to chase down.
dbeskx_dv(v, x) = autodiff(Forward, _v->besselkx_levin(_v, x, Val(20)), 
                           Duplicated, Duplicated(v, 1.0))[2]

#= 
# A reference so that you can check that the half-integer derivatives work.
#
using FiniteDifferences
dbeskx_dv_ref(v, x) = central_fdm(10,1)(_v->besselkx(_v, x), v)
=#

And you can check that the derivatives agree with the finite diff incredibly well even for v=1.5, v=2.5, and so on. This seems like a complete win to me.

So, @heltonmc: at some point, can you elaborate to me on why you have the invterm stuff in the current besselkx_levin? We do get the right answers without it, so does it somehow significantly speed up the actual Levin transformation code?

@heltonmc
Copy link
Member Author

heltonmc commented Apr 27, 2023

This is awesome. Should we maybe split the particular Enzyme stuff into a separate issue to keep track of?

Just to keep everyone in the loop I think we are going to have to go the route of custom rules for some functions. That might be only way to get great performance and accuracy. The plan was to perhaps have this as a separate module that lives within Bessels and use a weak dep. and package extension or perhaps move it into it's own separate package like BesselsAD that focuses on being compatible with perhaps a larger AD framework.

@cgeoga I just pushed recent changes to the besselk_levin code over at #95. I'll paste the new functions below.

@generated function besselkx_levin(v, x::T, ::Val{N}) where {T <: FloatTypes, N}
    :(
        begin
            s = zero(T)
            t = one(T)
            l = @ntuple $N i -> begin
                    s += t
                    t *= (4*v^2 - (2i - 1)^2) / (8 * x * i)
                    invterm = inv(t)
                    Vec{2, T}((s * invterm, invterm))
                end
            return levin_transform(l) * sqrt/ 2x)
        end
    )
end

@generated function besselkx_levin(v, x::Complex{T}, ::Val{N}) where {T <: FloatTypes, N}
    :(
        begin
            s = zero(typeof(x))
            t = one(typeof(x))
            t2 = t
            a = @fastmath inv(8*x)
            a2 = 8*x

            l =  @ntuple $N i -> begin
                    s += t
                    b = (4*v^2 - (2i - 1)^2) / i
                    t *= a * b
                    t2 *= a2 / b
                    Vec{4, T}((reim(s * t2)..., reim(t2)...))
                end
            return levin_transform(l) * sqrt/ 2x)
        end
    )
end

These functions take advantage of constant propagation. They were a little faster but was just trying to improve the code clarity a little bit. I also refactored the complex code. The issue with the complex function is you need to desperately avoid any complex divisions so the different accumulators are to make sure we only perform real divisions or at worst a complex divided by a real and always avoid a real divided by complex or complex divided by complex. Even doing like ten extra multiplies is worth this cost. If you could try it with this new code that would be great on that other branch!

@cgeoga
Copy link
Contributor

cgeoga commented Apr 27, 2023

Very nice. Now that I understand how to do the levin_transform trickery I think I can write rules that work with those output Vec{N,T} objects and stuff. Ideally, I can just write a few of those manual forward rules for EnzymeRules.forward(::Const{typeof(levin_transform)}, [...] and we won't have to touch anything else. I don't expect any issues that can't be tinkered away.

Also, good point about a separate issue. I'll start a new one and use that method as the top post on it when it is done.

@heltonmc
Copy link
Member Author

Perfect !! Ya I think above in this thread I showed really you just have to take the derivative of the sequences and feed them into the sequence transformation and should get good derivatives. I would actually imagine that you would be able to get derivatives then in the same time as you can scalar evaluation.

One thing I just want to note here why I didn’t use the ComplexVec type in SIMDMath (considering I spent all the time developing that :) ) for the complex implementation.

Only two complex numbers are able to utilize simd so using the alternating storage approach allows the operations to fit in a 4 x double simd operation. This coupled with only needing complex-real multiplies and complex-complex addition within the levin loop means that this is a really special case where the alternating storage approach is superior. I should make note of that in the files somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants