Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoopVectorization.jl support #123

Closed
maj0e opened this issue Aug 4, 2022 · 13 comments
Closed

LoopVectorization.jl support #123

maj0e opened this issue Aug 4, 2022 · 13 comments

Comments

@maj0e
Copy link

maj0e commented Aug 4, 2022

Hi,

while implementing a solution to a radiative transfer problem, I encountered a problem when trying to use a CubicSpline with
SIMD instructions.
Basically I have a integration routine using the @turbo macro from LoopVectorization.jl, where the integrand contains an interpolator from DataInterpolations.jl.

Here is a small example reproducing the problem:

using DataInterpolations
using LoopVectorization

u = [14.7, 11.51, 10.41, 14.95, 12.24, 11.22]
t = [0.0, 62.25, 109.66, 162.66, 205.8, 252.3]

A = CubicSpline(u,t)

function simd_example(A)
    f = x -> A(x)
    x_arr = collect(range(10.5,14.0, 100))
    y_arr = zero(x_arr)
    @turbo for i in eachindex(x_arr)
        y_arr[i] = A(x)
    end
    return y_arr
end

simd_example(A)

Here is the output:

ERROR: TypeError: non-boolean (Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] signless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:143
  [2] isless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:185
  [3] searchsortedlast
    @ ~/Schreibtisch/Test_simd.jl:19 [inlined]
  [4] searchsortedlast
    @ ./sort.jl:295 [inlined]
  [5] #searchsortedlast#5
    @ ./sort.jl:297 [inlined]
  [6] searchsortedlast
    @ ./sort.jl:297 [inlined]
  [7] _interpolate(A::CubicSpline{Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, true, Float64}, t::MM{4, 1, Int64})
    @ DataInterpolations ~/.julia/packages/DataInterpolations/Al4Ib/src/interpolation_methods.jl:146
  [8] AbstractInterpolation
    @ ~/.julia/packages/DataInterpolations/Al4Ib/src/DataInterpolations.jl:37 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/LoopVectorization/ORJxS/src/reconstruct_loopset.jl:947 [inlined]
 [10] _turbo_!
    @ ~/.julia/packages/LoopVectorization/ORJxS/src/reconstruct_loopset.jl:947 [inlined]

The vectorization fails in searchsortedlast(...), when calling isless with vectorized arguments.
I tried to define an adapted version of searchsortedlast, but to no avail so far.

Would be great if you could share your opinion on how to approach or circumvent this problem.
Judging from my benchmarks with non interpolated functions I would get a massive 3x speedup from using LoopVectorization.jl.

@ChrisRackauckas
Copy link
Member

You can't SIMD through that easily. It's a non-static search. I wouldn't expect LoopVectorization to work on that.

@maj0e
Copy link
Author

maj0e commented Aug 4, 2022

Thanks for the quick feedback. I don't have much experience with vectorized code and was already afraid, that there won't be an easy solution.
I probably will reformulate my solution then to evaluate the interpolator before the vectorized loop.

@ChrisRackauckas
Copy link
Member

Yeah, that hoisting out of the loop would make the latter code vectorize. That's what would be needed here.

@chriselrod
Copy link
Contributor

I could probably write SIMD versions.

@chriselrod
Copy link
Contributor

chriselrod commented Aug 4, 2022

I'd say this would be a good first issue for someone else to try.

When trying to write a SIMD searchsortedlast, I'd recommend looking at the gcd as inspiration: https://github.com/JuliaSIMD/VectorizationBase.jl/blob/25df93dfe33e717715dd1cde03546769490557de/src/special/misc.jl#L160-L184

@ChrisRackauckas
Copy link
Member

But @turbo definitely won't do that automatically (unless your LLVM versions is getting really good 😅)

@chriselrod
Copy link
Contributor

chriselrod commented Aug 4, 2022

But @turbo definitely won't do that automatically (unless your LLVM versions is getting really good sweat_smile)

You can define SIMD versions of functions like searchsortedlast to prevent method errors.

@turbo abuses multiple dispatch, which hurts compile times, but at least makes it easy to plug in with our own methods.

ERROR: TypeError: non-boolean (Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] signless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:143
  [2] isless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:185
  [3] searchsortedlast
    @ ~/Schreibtisch/Test_simd.jl:19 [inlined]
  [4] searchsortedlast
    @ ./sort.jl:295 [inlined]
  [5] #searchsortedlast#5
    @ ./sort.jl:297 [inlined]
  [6] searchsortedlast
    @ ./sort.jl:297 [inlined]
  [7] _interpolate(A::CubicSpline{Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, true, Float64}, t::MM{4, 1, Int64})
    @ DataInterpolations ~/.julia/packages/DataInterpolations/Al4Ib/src/interpolation_methods.jl:146
  [8] AbstractInterpolation
    @ ~/.julia/packages/DataInterpolations/Al4Ib/src/DataInterpolations.jl:37 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/LoopVectorization/ORJxS/src/reconstruct_loopset.jl:947 [inlined]
 [10] _turbo_!

E.g., could define a working _interpolate method.
May involve also defining methods like searchsortedlast that work for the SIMD types.

@chriselrod
Copy link
Contributor

It'll be a while before the LLVM version can do that automatically, but that's the dream of course!
Automatic is better than "easy to plug in".

@maj0e
Copy link
Author

maj0e commented Aug 5, 2022

I decided to give it a shot and implement a SIMD version.
For DataInterpolators.CubicSpline to work inside a @turbo loop it was enough to write a vectorized version of searchsortedlast and getindexas following:

import Base.searchsortedlast
import VectorizationBase.AbstractSIMDVector
using LoopVectorization: vany, ifelse, Vec
Base.Sort.midpoint(lo::AbstractSIMDVector{W,I}, hi::AbstractSIMDVector{W,I}) where {W,I<:Integer} = lo + ((hi - lo) >>> 0x01)
Base.getindex(A::AbstractArray, i::Vec{W}) where W = Vec{W,eltype(A)}(getindex(A,[Tuple(i)...])...)
function searchsortedlast(v::AbstractVector, x::AbstractSIMDVector{W,I}, lo::T, hi::T, o::Base.Ordering) where {W,I,T<:Integer}
    u = Vec{W,T}(1)
    lo = lo - u
    hi = hi + u
    st = lo < hi - u
    @inbounds while vany(st)
        m = Base.Sort.midpoint(lo, hi)
        b = (x < v[m]) & st 
        hi = ifelse(b, m, hi)
        lo = ifelse(b, lo, m)
        st = lo < hi - u
    end
    return lo
end

For my small toy example from the beginning, this gives the correct result:

using DataInterpolations
using LoopVectorization
u = [14.7, 11.51, 10.41, 14.95, 12.24, 11.22]
t = [0.0, 62.25, 109.66, 162.66, 205.8, 252.3]
A = CubicSpline(u,t)

function simd_test(A)
    f = x -> A(x)
    x_arr = collect(range(10.5,14.0, 100))
    y_arr = similar(x_arr)
    y_simd_arr = similar(x_arr)

    for i in eachindex(x_arr)
        y_arr[i] = A(x_arr[i])
    end

    @turbo for i in eachindex(x_arr)
        y_simd_arr[i] = A(x_arr[i])
    end
    return y_arr  y_simd_arr
end
@assert simd_test(A) # --> true

The solution I came up with is quite rough and needs some more thought put in to work with all AbstractSIMDVector.
getindex(), for example, should take a AbstractSIMDVector{W} instead of Vec{W}, but I haven't found a way to make that type stable yet:

Base.getindex(A::AbstractArray, i::AbstractSIMDVector{W}) where W = AbstractSIMDVector{W,eltype(A)}(getindex(A,[Tuple(i)...])...)

@chriselrod
Copy link
Contributor

chriselrod commented Aug 5, 2022

@inline function Base.getindex(A::AbstractArray, i::AbstractSIMD...)
    VectorizationBase.vload(stridedpointer(A), i)
end

This should be much faster than your current version.

Because your current version is more specific, it would be preferred by dispatch, so make sure to either start a new Julia session or define a ::Vec version here as well.

I'd once thought about adding methods like this to VectorizationBase. Not sure why I didn't.
IIRC, it was annoying because of ambiguities.
E.g., do I also have to add

@inline function Base.getindex(A::AbstractArray, j::AbstractSIMD, i::Union{Integer,AbstractSIMD}...)
    VectorizationBase.vload(stridedpointer(A), (j,i...))
end
@inline function Base.getindex(A::AbstractArray, k::Integer, j::AbstractSIMD, i::Union{Integer,AbstractSIMD}...)
    VectorizationBase.vload(stridedpointer(A), (k,j,i...))
end

etc to support using Integers as well as AbstractSIMD.

@maj0e
Copy link
Author

maj0e commented Aug 6, 2022

With your version of getindex the code is actually type stable and my radiance calculation as a whole is around 2x faster than the previous unvectorized version (judging from the profiler this seems to be near optimal).

The remaining question for me is, whether you want any of that in upstream?

I think there are multiple options:

  1. getindex and searchsortedlast in VectorizationBase
  2. searchsortedlast in VectorizationBase calling vload directly, vectorized version of interpolate in DataInterpolators
  3. None of that and I keep the vectorised version only for me.

Option 2 would avoid ambiguities with getindex, but would require more code. What do you think?

Anyways, I'm pretty happy with the result and achieved my goal to speed up the previous numpy version significantly (20x - 24x for my benchmarks). Thanks for guiding me through this.

@chriselrod
Copy link
Contributor

chriselrod commented Aug 6, 2022

20x is something I like to hear!

I think "1." is good. We don't need all getindex to work.

Although we may also prefer calling vload directly.
Also, I'd mark everything as @inline, as llvmcall is considered to be much more expensive than most of these intrinsics actually are.

@maj0e
Copy link
Author

maj0e commented Aug 12, 2022

JuliaSIMD/VectorizationBase.jl#90 was merged. Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants