auto wrapper #130

simonbyrne · 2023-12-30T00:25:28Z

Avoids need for @gsl_function macro as suggested in #128

simonbyrne · 2023-12-31T04:58:04Z

@lxvm can you try this out? The following seems to work for me:

using GSL

fquad = x -> x^1.5
ws_size = 100
ws = integration_cquad_workspace_alloc(ws_size)

a = 0
b = 1
result = Cdouble[0]
abserr = Cdouble[0]
nevals = Csize_t[0]
integration_cquad(fquad, a, b, 1e-10, 1e-10, ws, result, abserr, nevals)
result[1]

lxvm · 2023-12-31T16:10:30Z

Thanks @simonbyrne, indeed this approach works well and removes the need for any wrapping with gsl_function. I also did a timing comparing my original approach in Integrals.jl to one relying on the pr, i.e. passing the function directly to integration_cquad without wrapping as a Base.CFunction, which appears to be ~2x slower.

branch	btime
`c869ad9`	2.076 μs (10 allocations: 320 bytes)
`a93f752`	2.547 μs (10 allocations: 320 bytes)
lxvm@`ca26f62`	2.548 μs (12 allocations: 368 bytes)
`28d06e3`	2.805 μs (15 allocations: 496 bytes)
pr	6.354 μs (77 allocations: 1.50 KiB)
master	2.503 μs (12 allocations: 528 bytes)

using GSL, Integrals, BenchmarkTools
prob = IntegralProblem((x,p) -> sin(p*x), 0.0, 1.0, 1.0)
alg = GSLIntegration(integration_cquad; wssize=100)
@btime solve($prob, $alg)

Just by looking at flame graphs, it seems that Base.CFunction is type-grounded whereas gsl_function_helper can't be. Presumably this is unavoidable and related to the portability issues of CFunction?

yuyichao · 2023-12-31T16:47:13Z

You should be able to specialize gsl_function_helper.

lxvm · 2023-12-31T17:30:33Z

Thanks @yuyichao ! That tip helps recover most of the performance. I made a wrapper FixP that I pass to integrate_cquad and specialize it as follows

struct FixP{F<:IntegralFunction,P}
    f::F
    p::P
end

function GSL.gsl_function_helper(x::Cdouble, @specialize(f::FixP))::Cdouble
    try
        Cdouble(isinplace(f) ? (f.f(f.integrand_prototype, x, f.p); only(f.integrand_prototype)) : f.f(x, f.p))
    catch
        NaN
    end
end

which gives a better benchmark within 20% of the original

cache = init(prob, alg)
@btime solve!($cache)
  2.753 μs (68 allocations: 1.08 KiB)

and a flamegraph that still shows a lot of dispatch

In comparison to QuadGK.jl, which is probably doing a similar computation, if the dispatch could be removed then the timings would be comparable

alg = QuadGKJL()
cache = init(prob, alg)
@btime solve!($cache)
  137.133 ns (0 allocations: 0 bytes)

simonbyrne · 2023-12-31T20:07:15Z

@lxvm Please feel free to make another PR!

simonbyrne · 2023-12-31T22:01:47Z

You should be able to specialize gsl_function_helper.

How would that help when we call fptr = @cfunction(gsl_function_helper, Cdouble, (Cdouble, Any)), which is the non-specialized version?

yuyichao · 2024-01-01T00:36:51Z

See 28d06e3

The point is that you shouldn't call cfunction with Any as the parameter but should allow it to specialize instead, i.e. using Ref as the parameter type.

Fundamentally the rest isn't very different from passing the function as an arbitrary pointer to a c function but some care is needed to make sure all the allocations are properly rooted. The version implemented above should achieve no allocation and no dynamic dispatch though I haven't really checked carefully and probably won't have time to do so in a few days.

lxvm · 2024-01-03T17:43:41Z

@yuyichao That's really clever. Is it correct to say that the same pointer to data is interpreted as a Ref{T} by Julia but as void * to C?

Your branch seems close to optimal and I've updated the timing table to include it and here is also a timing of just the library call

cache = init(prob, alg)
@btime solve!($cache)
  2.130 μs (5 allocations: 176 bytes)

Of the remaining allocations, 3 of them can be elided by specializing function integration_cquad(f::F, a, b, epsabs, epsrel, ws, result, abserr, nevals) where {F} ..., as in lxvm@ca26f62, and the two that remain are the allocation of the mutablegsl_function and RefValue, so that gives

@btime solve!($cache)
  1.991 μs (2 allocations: 48 bytes)

Presumably the gsl_function and RefValue could be pre-allocated and their pointers updated as necessary?

simonbyrne · 2024-01-03T17:49:12Z

@yuyichao That's interesting: how does it actually work, since we define gsl_function_helper(x::Cdouble, fn::F)::Cdouble, but construct the @cfunction with a Ref{F} as the 2nd argument instead? Why isn't this a method error?

yuyichao · 2024-01-03T17:56:41Z

I have a updated branch almost ready with more testing.

yuyichao · 2024-01-03T18:36:02Z

See the updated branch at https://github.com/JuliaMath/GSL.jl/tree/yyc/gsl_func

Either the previous or the new version should be able to achieve optimal performance (i.e. no dynamic dispatch and no allocation) but the compiler's allocation optimization seems to have some corner case/limitation and wasn't able to optimize too complex a structure out. That's why the new version simplified this a little bit making use of the fact that gsl_function is mutable (the old version works for both mutable and immutable gsl_function and should be more or less equivalent to the new version when gsl_function is immutable)

The only case where the allocation isn't completely eliminated is when the callable type is immutable but contains references to mutable fields (immutable struct that are not pointer-free).

And yes the wrapper function needs to specialize on the function parameter, this is to trigger the specialization logic for <:Function types. A custom callable type without being <:Function does not have this limitation. I've added it to all the functions that uses gsl_function.

Presumably the gsl_function and RefValue could be pre-allocated and their pointers updated as necessary?

No absolutely not. That's is very bad/unsafe.

how does it actually work, since we define gsl_function_helper(x::Cdouble, fn::F)::Cdouble, but construct the @cfunction with a Ref{F} as the 2nd argument instead? Why isn't this a method error?

This is somewhat of a special case for cfunction that makes it symmetric with ccall. A ccall with Ref{T} as a parameter type accepts an argument of type T and will pass it in as a pointer to that object. cfunction with Ref as the argument type essentially a reverse of it. The two also support symmetric optimization: ccall of Ref{T} with an bitstype T will allocate it on stack and the cfunction with the same argument type will not assume the pointer points to an heap allocated object either.

lxvm · 2024-01-03T19:35:01Z

No absolutely not. That's is very bad/unsafe.

Thanks, I would have wanted to avoid that.

For completeness I updated the table above and can confirm that the allocations are elided,

@btime solve!($cache)
  2.084 μs (0 allocations: 0 bytes)

All of this is great to know, and I'm wondering if it would be appropriate to add to the Julia manual. I'd be happy to start a pr for it based on your comment.

And presumably the performance difference between integration_cquad and quadgk is that the former has to lookup pointers to functions and data whereas the latter can inline them?

yuyichao · 2024-01-03T20:38:11Z

I have no idea of the specific algorithms or the underlying implementations. If the actual algorithm as well as the implementation/optimization level is exactly the same then the difference would indeed only be on the C-julia interface.

How is the new performance compare to the old version when you pass in your own gsl_function? Or even with c function fwiw. How many times did your callback function get called? Does that take a significant amount of time? What if you remove the try-catch in the gsl_function_helper?

simonbyrne · 2024-01-03T20:42:45Z

@yuyichao thanks, that was helpful.

Regarding this comment:
master...yyc/gsl_func#diff-ed5863a4bdd6189a79314f8c6cb0d3acefa0fe9235e7df221e6686fedb9efc08R41-R44

If gsl_function was an immutable type, would the main change be that we would instead call pointer_from_objref on a Ref{gsl_function} instead of gsl_function itself?

yuyichao · 2024-01-03T21:23:19Z

Effectively yes, but the more canonical way is that the creation of Ref{gsl_function} would be done via Base.cconvert(Ref{gsl_function}, ...) and the pointer_from_objref would be done with Base.unsafe_convert(Ref{gsl_function}, ...). This works for mutable type too, but create a RefValue{gsl_function} that needs to be optimized out, which the compiler is able to do in some cases but somehow not for others.

lxvm · 2024-01-03T22:05:15Z

I was also wondering why here param_ref is returned with the gsl_function? Does that prevent the param_ref from being GC'd early, even though it is ignored in unsafe_convert?

yuyichao · 2024-01-03T22:08:35Z

Yes.

lxvm · 2024-01-03T23:52:35Z

Indeed, removing the try/catch block returns the fastest version yet. Here are plain benchmarks that should be copy-paste-able. At the end of the day a quadrature scheme's efficiency is measured by the number of function evaluations and there is slight variation in the number of function calls made by each algorithm, but nothing major. So the timings suggest a ~6x algorithm+language overhead for integration_cquad compared to quadgk, which of course becomes less important if the integrand is expensive. I'm glad that this discussion has been fruitful for both the compatibility with runtime callbacks and the performance and I'll continue finishing the wrappers in #131 .

Original test (master)

using QuadGK, GSL, BenchmarkTools, Test

atol = 1e-10
rtol = 1e-10
for p in [1.0, 10.0, 100.0]
    f = x -> sin(x*p)
    a = 0
    b = 1

    ws_size = 1000
    ws = integration_cquad_workspace_alloc(ws_size)
    n1, t1 = try
        result = Cdouble[0]
        abserr = Cdouble[0]
        nevals = Csize_t[0]

        ptr = @cfunction($((x,p) -> f(x)), Cdouble, (Cdouble, Ptr{Cvoid}))
        GC.@preserve ptr begin
            gslf = gsl_function(Base.unsafe_convert(Ptr{Cvoid},ptr), C_NULL)
            integration_cquad(gslf, a, b, atol, rtol, ws, result, abserr, nevals)
            @test result[1] ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol

            n = Int(nevals[1])
            n, @belapsed integration_cquad($gslf, $a, $b, $atol, $rtol, $ws, $result, $abserr, $nevals)
        end
    catch e
        display(e)
    finally
        integration_cquad_workspace_free(ws)
    end

    segbuf = QuadGK.alloc_segbuf()
    I, E, n2 = quadgk_count(f, a, b; atol, rtol, segbuf)
    @test I ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
    t2 = @belapsed quadgk($f, $a, $b; atol=$atol, rtol=$rtol, segbuf=$segbuf)

    @info "Quadrature benchmark" difficulty=p cquad_evals=n1 cquad_time=t1 quadgk_evals=n2 quadgk_time=t2 ratio_evals=n1/n2 ratio_time=t1/t2
end

Original results

┌ Info: Quadrature benchmark
│   difficulty = 1.0
│   cquad_evals = 33
│   cquad_time = 2.6535555555555555e-6
│   quadgk_evals = 15
│   quadgk_time = 1.2850338600451468e-7
│   ratio_evals = 2.2
└   ratio_time = 20.649693662253608
┌ Info: Quadrature benchmark
│   difficulty = 10.0
│   cquad_evals = 95
│   cquad_time = 9.177e-6
│   quadgk_evals = 105
│   quadgk_time = 1.1612e-6
│   ratio_evals = 0.9047619047619048
└   ratio_time = 7.9030313468825355
┌ Info: Quadrature benchmark
│   difficulty = 100.0
│   cquad_evals = 1235
│   cquad_time = 0.000111487
│   quadgk_evals = 855
│   quadgk_time = 1.1026e-5
│   ratio_evals = 1.4444444444444444
└   ratio_time = 10.11128242336296

New test (lxvm/gsl_func2)

using QuadGK, GSL, BenchmarkTools, Test

atol = 1e-10
rtol = 1e-10
for p in [1.0, 10.0, 100.0]
    f = x -> sin(x*p)
    a = 0
    b = 1

    ws_size = 1000
    ws = integration_cquad_workspace_alloc(ws_size)
    n1, t1 = try
        result = Cdouble[0]
        abserr = Cdouble[0]
        nevals = Csize_t[0]
        # gslf = @gsl_function(f)
        integration_cquad(f, a, b, atol, rtol, ws, result, abserr, nevals)
        @test result[1] ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
        n = Int(nevals[1])
        n, @belapsed integration_cquad($f, $a, $b, $atol, $rtol, $ws, $result, $abserr, $nevals)
    catch e
        display(e)
    finally
        integration_cquad_workspace_free(ws)
    end

    segbuf = QuadGK.alloc_segbuf()
    I, E, n2 = quadgk_count(f, a, b; atol, rtol, segbuf)
    @test I ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
    t2 = @belapsed quadgk($f, $a, $b; atol=$atol, rtol=$rtol, segbuf=$segbuf)

    @info "Quadrature benchmark" difficulty=p cquad_evals=n1 cquad_time=t1 quadgk_evals=n2 quadgk_time=t2 ratio_evals=n1/n2 ratio_time=t1/t2
end

New result

┌ Info: Quadrature benchmark
│   difficulty = 1.0
│   cquad_evals = 33
│   cquad_time = 1.8343e-6
│   quadgk_evals = 15
│   quadgk_time = 1.2889164785553046e-7
│   ratio_evals = 2.2
└   ratio_time = 14.23133329830645
┌ Info: Quadrature benchmark
│   difficulty = 10.0
│   cquad_evals = 95
│   cquad_time = 6.4998e-6
│   quadgk_evals = 105
│   quadgk_time = 1.1675e-6
│   ratio_evals = 0.9047619047619048
└   ratio_time = 5.56728051391863
┌ Info: Quadrature benchmark
│   difficulty = 100.0
│   cquad_evals = 1235
│   cquad_time = 7.5618e-5
│   quadgk_evals = 855
│   quadgk_time = 1.1579e-5
│   ratio_evals = 1.4444444444444444
└   ratio_time = 6.530615769928318

auto wrapper

f8a2a54

simonbyrne force-pushed the sb/helper branch from d42663b to f8a2a54 Compare December 31, 2023 04:55

remove some gsl_function macros

794c825

lxvm mentioned this pull request Dec 31, 2023

Wrap GSL.jl in extension SciML/Integrals.jl#213

Draft

7 tasks

lxvm mentioned this pull request Jan 3, 2024

auto wrapper take 2 #131

Closed

7 tasks

simonbyrne closed this Feb 2, 2024

simonbyrne deleted the sb/helper branch February 2, 2024 04:49

simonbyrne mentioned this pull request Feb 2, 2024

Avoid closures for custom operators JuliaParallel/MPI.jl#404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto wrapper #130

auto wrapper #130

simonbyrne commented Dec 30, 2023

simonbyrne commented Dec 31, 2023

lxvm commented Dec 31, 2023 •

edited

Loading

yuyichao commented Dec 31, 2023

lxvm commented Dec 31, 2023

simonbyrne commented Dec 31, 2023

simonbyrne commented Dec 31, 2023

yuyichao commented Jan 1, 2024

lxvm commented Jan 3, 2024

simonbyrne commented Jan 3, 2024

yuyichao commented Jan 3, 2024

yuyichao commented Jan 3, 2024

lxvm commented Jan 3, 2024

yuyichao commented Jan 3, 2024

simonbyrne commented Jan 3, 2024

yuyichao commented Jan 3, 2024 •

edited

Loading

lxvm commented Jan 3, 2024

yuyichao commented Jan 3, 2024

lxvm commented Jan 3, 2024 •

edited

Loading

auto wrapper #130

auto wrapper #130

Conversation

simonbyrne commented Dec 30, 2023

simonbyrne commented Dec 31, 2023

lxvm commented Dec 31, 2023 • edited Loading

yuyichao commented Dec 31, 2023

lxvm commented Dec 31, 2023

simonbyrne commented Dec 31, 2023

simonbyrne commented Dec 31, 2023

yuyichao commented Jan 1, 2024

lxvm commented Jan 3, 2024

simonbyrne commented Jan 3, 2024

yuyichao commented Jan 3, 2024

yuyichao commented Jan 3, 2024

lxvm commented Jan 3, 2024

yuyichao commented Jan 3, 2024

simonbyrne commented Jan 3, 2024

yuyichao commented Jan 3, 2024 • edited Loading

lxvm commented Jan 3, 2024

yuyichao commented Jan 3, 2024

lxvm commented Jan 3, 2024 • edited Loading

lxvm commented Dec 31, 2023 •

edited

Loading

yuyichao commented Jan 3, 2024 •

edited

Loading

lxvm commented Jan 3, 2024 •

edited

Loading