Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of vectorized evaluations #96

Closed
albop opened this issue Jan 11, 2016 · 20 comments
Closed

Performance of vectorized evaluations #96

albop opened this issue Jan 11, 2016 · 20 comments

Comments

@albop
Copy link

albop commented Jan 11, 2016

Out of curiosity I just ran some performance comparisons between interpolations.jl and splines.jl. The latter is a small library which does only cubic splines interpolations, in any dimension, with gradient evaluation, and which I updated recently to use julia macros..
I found it it be much faster than interpolations.jl when evaluated on a large number of points, even when compared with interpolations.jl quadratic interpolation. The code to generate the comparison is here: https://github.com/EconForge/splines.jl/blob/master/test/speed_comparison.jl
and here is the output I get on my machine (50x50x50 grid, evaluation on 100000 points):

Comparison (d=3, K=50, N=100000)
interpolations.jl (quadratic)
  0.101063 seconds (1.10 M allocations: 30.527 MB, 5.21% gc time)
splines.jl: 
  0.012379 seconds (6 allocations: 781.469 KB)
splines.jl (with derivatives): 
  0.026054 seconds (11 allocations: 3.052 MB)

Here are a few remarks/questions coming to my mind:

  • Where do you think the performance gap come from ? In splines.jl, the iteration over the points happen inside the function. Can it explain all the difference ?
  • If I understand correctly, in principle, the evaluation of the splines at all the points could be vectorized with simd inststructions to get more performance gains, but that doesn't seem to happen (in splines.jl), even using "@fastmath @inbounds @simd". Maybe I'm not doing it right. An equialent C code was successfully vectorized by the compiler.
  • How hard would it be to replicate the functionalities from splines.jl in interpolations.jl (vectorization, multisplines, gradient) ? Would it be easier to merge splines.jl ?
@sglyon
Copy link
Member

sglyon commented Jan 11, 2016

Very quick comments:

  • If by multi-splines you mean potentially different basis across dimensions, this is already here.
  • Gradient is in place for all but cubic.
  • Vector-based evaluation (multiple points at once -- kinda your point 1 above) is implemented for the Gridded variant of splines (linear and quadratic only), but could probably be leveraged in general to avoid a full binary search on each evaluation (vectorization is a different question though)

@albop
Copy link
Author

albop commented Jan 11, 2016

Thanks, @spencerlyon2 .
By multi-splines, I referred to (maybe improperly) the representation of several vector valued functions as one "multi-spline". In that case, some time can be saved by computing the coefficients in the blending formulas only once for all functions, at each evaluation point.

@tomasaschan
Copy link
Contributor

Given the high amount of memory allocation in the test run for Interpolations.jl, I'm suspicious against the splatting you use for the indexing... Will investigate further.

Edit: Splatting isn't the entire problem, but it's a big part of it; exchanging the indexing line for s += +(I...) reduces the memory allocation from 1.1 M (about 30 MB) to 600 k (about 16 MB), indicating that splatting the indices alone is more than half of the memory usage. Since you don't splat your indices when evaluating splines.jl, the comparison isn't entirely fair. I have yet to figure out what the other half of the memory allocations come from, though.

Edit2: the second half seems to be a type instability in the creation of LB (at least when I run it from JuliaBox). If I change that line to LB = Vector{Float64}[copy(slice(B,i,:)) for i=1:size(B,1)] (note the type annotation at the head of the list comprehension), and the indexing expression to itp[I[1], I[2], I[3]], I get the following results (on JuliaBox):

Comparison (d=3, K=50, N=100000)
interpolations.jl (quadratic)
  0.004493 seconds (5 allocations: 176 bytes)
interpolations.jl (cubic)
  0.007544 seconds (5 allocations: 176 bytes)
splines.jl: 
  0.011770 seconds (6 allocations: 781.469 KB)
splines.jl (with derivatives): 
  0.027235 seconds (11 allocations: 3.052 MB)

(Cubic splines were tested with Flat() instead of Reflect(), because Reflect() is not implemented yet for cubic splines). I realize that now we're not using vector-valued evaluation anymore, but the performance hit you saw was caused before the calls to Interpolations.jl.

As you see, Interpolations.jl is now slightly faster than splines.jl.

@tomasaschan
Copy link
Contributor

I'm also guessing true vector-valued interpolation is actually a little broken at the moment:

itp2 = interpolate(rand(30,40), BSpline(Cubic(Flat())), OnGrid())
I = rand(1.:.1:30., 3, 2)
I = Vector{Float64}[sort(I[:,i]) for i in 1:size(I,2)]
itp2[I...]

WARNING: indexing with non Integer AbstractArrays is deprecated
 in depwarn at deprecated.jl:73
 in to_index at deprecated.jl:453
 in _unsafe_getindex at multidimensional.jl:192
 in getindex at abstractarray.jl:488
 in include_string at loading.jl:266
 in execute_request_0x535c5df2 at /opt/julia_packages/.julia/v0.4/IJulia/src/execute_request.jl:177
 in eventloop at /opt/julia_packages/.julia/v0.4/IJulia/src/IJulia.jl:141
 in anonymous at task.jl:447
while loading In[1], in expression starting on line 5

LoadError: InexactError()
while loading In[1], in expression starting on line 5

 in to_index at deprecated.jl:454
 in _unsafe_getindex at multidimensional.jl:192
 in getindex at abstractarray.jl:488

We probably need to look at that closer... (ref #55)

@albop
Copy link
Author

albop commented Jan 11, 2016

Ah, excellent ! @tlycken Your explanation about the splatting makes perfect sense. I just tried it and get similar results on my laptop.
Now I need to understand why my version doesn't perform better ;-) It performs a few more operations in the loop like the rescaling of the approximation space, but I have a hard time believing this slows down things so much.
I'm also very curious about the vector valued part. I took the stance of representing vector values functions by 2d arrays so as to get a precise idea of what low-level code is generated. But I would be bewildered (and pleased) to see that one can get the same level of performance with a more abstract implementation (like replacing scalars by vectors since all operations are linear with respect to them).

@timholy
Copy link
Member

timholy commented Jan 11, 2016

For vector-valued interpolation, I recommend FixedSizedArrays. Check out this awe-inspiring result:

using FixedSizeArrays
a = rand(5,200)
b = reinterpret(Vec{5,Float64}, a, (200,))
bitp = interpolate(b, BSpline(Quadratic(Reflect())), OnCell())

function foo(itp, X)
    s = zero(eltype(itp))
    for x in X
        s += itp[x]
    end
    s
end

X = 2+150*rand(10^6)

# After warming up with one call to `foo(bitp, X)`:
julia> @time foo(bitp, X)
  0.027119 seconds (5 allocations: 208 bytes)
FixedSizeArrays.Vec{5,Float64}((466133.70976438577,522291.33201962017,540706.6050416315,468993.60840671643,489241.67073413206))

(The most awe-inspiring part being that it allocated just 208 bytes.)

CCing @SimonDanisch, just for fun.

@tomasaschan
Copy link
Contributor

Ah, we're talking different types of vector-valued. I was referring to evaluating a scalar-valued interpolant in many places by passing a vector of coordinates; you were talking about evaluating a vector-valued interpolant efficiently. For the latter, FIxedSizeArrays is definitely the way to go.

@albop
Copy link
Author

albop commented Jan 12, 2016

@timholy: very impressive example indeed.
@tlycken: I was thinking about vector-valued interpolants, but this is all very interesting anyway. Thanks.

@timholy
Copy link
Member

timholy commented Jan 12, 2016

(To be clear, the part I find awe-inspiring is being done by LLVM and julia's code generation/compiler, in figuring out that it can elide the memory allocations needed for all the intermediate computations. Interpolations.jl's and FixedSizeArrays.jl's main contributions are "simply" to leverage that compiler effectively.)

@timholy
Copy link
Member

timholy commented Jan 12, 2016

Are there any more issues to address, or can this be closed?

@albop
Copy link
Author

albop commented Jan 12, 2016

@timholy if you prefer to move the discussion somewhere and close this issue, I'm OK with it. There are still two things I'm curious about

Comparison (d=3, K=50, N=1000000)
interpolations.jl (cubic)
  0.046390 seconds (149 allocations: 10.167 KB)
interpolations.jl: (vector x10)
  0.831623 seconds (5 allocations: 256 bytes)
splines.jl: (scalar)
  0.146415 seconds (6 allocations: 7.630 MB)
splines.jl: (vector x10)
  1.446312 seconds (6 allocations: 76.294 MB, 0.10% gc time)
  1.437065 seconds (4 allocations: 160 bytes)
  • does any of the generated code uses simd instructions ? (Implicitly or explicitely) I am trying to add that to splines.jl and will return with another comparison point if it works.

@timholy
Copy link
Member

timholy commented Jan 12, 2016

If I jump off my previous benchmark, I get a very different result from you:

c = rand(200)
citp = interpolate(c, BSpline(Quadratic(Reflect())), OnCell())

julia> @time foo(citp, X)
  0.016107 seconds (5 allocations: 176 bytes)
485235.8329713323

julia> @time foo(bitp, X)
  0.031591 seconds (5 allocations: 208 bytes)
FixedSizeArrays.Vec{5,Float64}((503188.47280675353,500705.8250749416,482481.291493598,487826.4439053939,500737.8889314253))

Only a 2-fold increment for interpolation with 5-vectors. I haven't looked at your benchmark, though, and am too busy with other things to dig into it now.

Regarding simd, I think only if it's automatically added by Julia (which it might be).

@tomasaschan
Copy link
Contributor

In the script you linked, you are comparing scalar quadratic interpolation with vector-valued cubic :) Fixing that, JuliaBox gives me

Comparison (d=3, K=50, N=1000000)
interpolations.jl (cubic)
  0.075881 seconds (5 allocations: 176 bytes)
interpolations.jl: (vector x10)
  0.838605 seconds (5 allocations: 256 bytes)

i.e. about a factor 11 slowdown, which is about what I'd expect.

I don't think we're using SIMD instructions much, but we gain much of the speed from heavy use of metaprogramming; the compiled indexing expression consists literally only of adds, multiplies and array lookups from the underlying data array. getindex(itp::Interpolation, args...) is a generated function, and we do as much as we can compile-time.

@timholy
Copy link
Member

timholy commented Jan 12, 2016

With @code_llvm I see sections like

 %44 = load [4 x double]* %43, align 8
  %45 = extractvalue [4 x double] %44, 0
  %46 = extractvalue [4 x double] %44, 1
  %47 = extractvalue [4 x double] %44, 2
  %48 = extractvalue [4 x double] %44, 3
  %49 = fmul double %45, %33
  %50 = fmul double %46, %33
  %51 = fmul double %47, %33
  %52 = fmul double %48, %33
  %53 = getelementptr inbounds %Vec* %42, i64 %28, i32 0
  %54 = load [4 x double]* %53, align 8
  %55 = extractvalue [4 x double] %54, 0
  %56 = extractvalue [4 x double] %54, 1
  %57 = extractvalue [4 x double] %54, 2
  %58 = extractvalue [4 x double] %54, 3

which makes me think it may be automatically added.

@albop
Copy link
Author

albop commented Jan 12, 2016

@tlycken and @timholy : point taken, this quadratic vs cubic thing had me confused from the beginning...

As for the gains of using the vector version (appart from conceptual elegance), I would then conclude that it works well for the quadratic case (twice as fast), not so much in the cubic one. Is that expected ?

@albop
Copy link
Author

albop commented Jan 12, 2016

And before this issue is closed, I have one last question: could you explain very briefly where the performance of interpolations.jl comes from ? @spencerlyon2 explained a bit to me about the precomputation of tensor products, which is something I've also tried to implement but apparently it was not enough.

@tomasaschan
Copy link
Contributor

I think the easiest way to showcase why Interpolations.jl is so fast, is to introspect the implementation. getindex is implemented like this:

@generated function getindex{T,N}(itp::BSplineInterpolation{T,N}, xs::Number...)
    getindex_impl(itp)
end

where getindex_impl is a function which takes a type and returns an Expr with the code that is to make the body of the function for that type (read up on generated functions if this is confusing to you).

Take a look at the code we generate:

julia> using Interpolations, Base.Cartesian

julia> macroexpand(Interpolations.getindex_impl(Interpolations.BSplineInterpolation{Float64,1,Array{Float64,1},BSpline{Cubic{Flat}},OnGrid,1}))
quote  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 22:
    $(Expr(:meta, :inline)) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 23:
    begin 
        x_1 = xs[1]
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 27:
    begin 
        begin  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 33:
            ix_1 = clamp(round(Int,real(x_1)),2 - 1,(size(itp,1) + 1) - 2) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 34:
            fx_1 = x_1 - ix_1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 35:
            ix_1 += 1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 36:
            ixm_1 = ix_1 - 1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 37:
            ixp_1 = ix_1 + 1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 38:
            ixpp_1 = ixp_1 + 1
        end
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 30:
    begin 
        begin  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 75:
            fx_cub_1 = cub(fx_1) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 76:
            one_m_fx_cub_1 = cub(1 - fx_1) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 77:
            cm_1 = SimpleRatio(1,6) * one_m_fx_cub_1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 78:
            c_1 = (SimpleRatio(2,3) - sqr(fx_1)) + SimpleRatio(1,2) * fx_cub_1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 79:
            cp_1 = (SimpleRatio(2,3) - sqr(1 - fx_1)) + SimpleRatio(1,2) * one_m_fx_cub_1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/cubic.jl, line 80:
            cpp_1 = SimpleRatio(1,6) * fx_cub_1
        end
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 33:
    begin 
        $(Expr(:boundscheck, false))
        begin 
            ret = cm_1 * itp.coefs[ixm_1] + c_1 * itp.coefs[ix_1] + cp_1 * itp.coefs[ixp_1] + cpp_1 * itp.coefs[ixpp_1]
            $(Expr(:boundscheck, :(Base.pop)))
        end
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 34:
    ret
end

There are three main blocks of code here:

  1. Create indexes ixm_1, ix_1, ixp_1 and ixpp_1, which denote the four points in the data array that will be considered.
  2. Evaluate the spline coefficients at those points, cm_1, c_1, cp_1 and cpp_1.
  3. Evaluate the result: ret = cm_1 * itp.coefs[ixm_1] + c_1 * itp.coefs[ix_1] + ...

Now, the beauty comes when we do this in more dimensions. I'll look at linear interpolation here, mainly for brevity, but the same principle is applied on higher orders too:

julia> macroexpand(Interpolations.getindex_impl(Interpolations.BSplineInterpolation{Float64,2,Array{Float64,2},BSpline{Linear},OnGrid,1}))
quote  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 22:
    $(Expr(:meta, :inline)) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 23:
    begin 
        x_1 = xs[1]
        x_2 = xs[2]
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 27:
    begin 
        begin  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 6:
            ix_1 = clamp(floor(Int,x_1),1,size(itp,1) - 1) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 7:
            ixp_1 = ix_1 + 1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 8:
            fx_1 = x_1 - ix_1
        end
        begin  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 6:
            ix_2 = clamp(floor(Int,x_2),1,size(itp,2) - 1) # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 7:
            ixp_2 = ix_2 + 1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 8:
            fx_2 = x_2 - ix_2
        end
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 30:
    begin 
        begin  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 15:
            c_1 = 1 - fx_1 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 16:
            cp_1 = fx_1
        end
        begin  # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 15:
            c_2 = 1 - fx_2 # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/linear.jl, line 16:
            cp_2 = fx_2
        end
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 33:
    begin 
        $(Expr(:boundscheck, false))
        begin 
            ret = c_1 * (c_2 * itp.coefs[ix_1,ix_2] + cp_2 * itp.coefs[ix_1,ixp_2]) + cp_1 * (c_2 * itp.coefs[ixp_1,ix_2] + cp_2 * itp.coefs[ixp_1,ixp_2])
            $(Expr(:boundscheck, :(Base.pop)))
        end
    end # /home/tlycken/.julia/v0.4/Interpolations/src/b-splines/indexing.jl, line 34:
    ret
end

Basically the same code, only now it works for two dimensions (the main difference between the linear and cubic case is the number of coefficients used in the calculation). Still no cruft, looping or other stuff related to handling interpolation in general - the code is specialized for this exact case. This is probably what Spencer was referring to as precomputation of tensor products.

The speed in Interpolations.jl comes mainly from the fact that we're able to generate specialized code like this, coupled with Julia's ability to very efficiently dispatch to the right version of the code, and optimize it agressively (since it comes out quite simple).

@tomasaschan
Copy link
Contributor

...and with that short book, I think I will close this :)

@albop
Copy link
Author

albop commented Jan 12, 2016

Thank you @tlycken ! This looks rather close to what I was doing. I'll need to investigate a bit more to understand the difference fully.
As for the simd instructions, right now explicit vectorization seems to be limited by the fact the floor instruction doesn't get vectorized.

@SimonDanisch
Copy link

thanks @timholy, it's nice to see use cases for FixedSizeArrays!
I'm curious if it will get even better optimized, when Julia starts using <N x T> instead of [N x T] :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants