LoopVectorization.jl support #123

maj0e · 2022-08-04T12:39:32Z

Hi,

while implementing a solution to a radiative transfer problem, I encountered a problem when trying to use a CubicSpline with
SIMD instructions.
Basically I have a integration routine using the @turbo macro from LoopVectorization.jl, where the integrand contains an interpolator from DataInterpolations.jl.

Here is a small example reproducing the problem:

using DataInterpolations
using LoopVectorization

u = [14.7, 11.51, 10.41, 14.95, 12.24, 11.22]
t = [0.0, 62.25, 109.66, 162.66, 205.8, 252.3]

A = CubicSpline(u,t)

function simd_example(A)
    f = x -> A(x)
    x_arr = collect(range(10.5,14.0, 100))
    y_arr = zero(x_arr)
    @turbo for i in eachindex(x_arr)
        y_arr[i] = A(x)
    end
    return y_arr
end

simd_example(A)

Here is the output:

ERROR: TypeError: non-boolean (Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] signless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:143
  [2] isless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:185
  [3] searchsortedlast
    @ ~/Schreibtisch/Test_simd.jl:19 [inlined]
  [4] searchsortedlast
    @ ./sort.jl:295 [inlined]
  [5] #searchsortedlast#5
    @ ./sort.jl:297 [inlined]
  [6] searchsortedlast
    @ ./sort.jl:297 [inlined]
  [7] _interpolate(A::CubicSpline{Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, true, Float64}, t::MM{4, 1, Int64})
    @ DataInterpolations ~/.julia/packages/DataInterpolations/Al4Ib/src/interpolation_methods.jl:146
  [8] AbstractInterpolation
    @ ~/.julia/packages/DataInterpolations/Al4Ib/src/DataInterpolations.jl:37 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/LoopVectorization/ORJxS/src/reconstruct_loopset.jl:947 [inlined]
 [10] _turbo_!
    @ ~/.julia/packages/LoopVectorization/ORJxS/src/reconstruct_loopset.jl:947 [inlined]

The vectorization fails in searchsortedlast(...), when calling isless with vectorized arguments.
I tried to define an adapted version of searchsortedlast, but to no avail so far.

Would be great if you could share your opinion on how to approach or circumvent this problem.
Judging from my benchmarks with non interpolated functions I would get a massive 3x speedup from using LoopVectorization.jl.

The text was updated successfully, but these errors were encountered:

ChrisRackauckas · 2022-08-04T12:41:14Z

You can't SIMD through that easily. It's a non-static search. I wouldn't expect LoopVectorization to work on that.

maj0e · 2022-08-04T13:08:20Z

Thanks for the quick feedback. I don't have much experience with vectorized code and was already afraid, that there won't be an easy solution.
I probably will reformulate my solution then to evaluate the interpolator before the vectorized loop.

ChrisRackauckas · 2022-08-04T13:18:32Z

Yeah, that hoisting out of the loop would make the latter code vectorize. That's what would be needed here.

chriselrod · 2022-08-04T13:52:16Z

I could probably write SIMD versions.

chriselrod · 2022-08-04T13:52:43Z

I'd say this would be a good first issue for someone else to try.

When trying to write a SIMD searchsortedlast, I'd recommend looking at the gcd as inspiration: https://github.com/JuliaSIMD/VectorizationBase.jl/blob/25df93dfe33e717715dd1cde03546769490557de/src/special/misc.jl#L160-L184

ChrisRackauckas · 2022-08-04T13:53:26Z

But @turbo definitely won't do that automatically (unless your LLVM versions is getting really good 😅)

chriselrod · 2022-08-04T13:54:14Z

But @turbo definitely won't do that automatically (unless your LLVM versions is getting really good sweat_smile)

You can define SIMD versions of functions like searchsortedlast to prevent method errors.

@turbo abuses multiple dispatch, which hurts compile times, but at least makes it easy to plug in with our own methods.

ERROR: TypeError: non-boolean (Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] signless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:143
  [2] isless(x::MM{4, 1, Int64}, y::Float64)
    @ Base ./operators.jl:185
  [3] searchsortedlast
    @ ~/Schreibtisch/Test_simd.jl:19 [inlined]
  [4] searchsortedlast
    @ ./sort.jl:295 [inlined]
  [5] #searchsortedlast#5
    @ ./sort.jl:297 [inlined]
  [6] searchsortedlast
    @ ./sort.jl:297 [inlined]
  [7] _interpolate(A::CubicSpline{Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, true, Float64}, t::MM{4, 1, Int64})
    @ DataInterpolations ~/.julia/packages/DataInterpolations/Al4Ib/src/interpolation_methods.jl:146
  [8] AbstractInterpolation
    @ ~/.julia/packages/DataInterpolations/Al4Ib/src/DataInterpolations.jl:37 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/LoopVectorization/ORJxS/src/reconstruct_loopset.jl:947 [inlined]
 [10] _turbo_!

E.g., could define a working _interpolate method.
May involve also defining methods like searchsortedlast that work for the SIMD types.

chriselrod · 2022-08-04T13:57:16Z

It'll be a while before the LLVM version can do that automatically, but that's the dream of course!
Automatic is better than "easy to plug in".

maj0e · 2022-08-05T09:35:25Z

I decided to give it a shot and implement a SIMD version.
For DataInterpolators.CubicSpline to work inside a @turbo loop it was enough to write a vectorized version of searchsortedlast and getindexas following:

import Base.searchsortedlast
import VectorizationBase.AbstractSIMDVector
using LoopVectorization: vany, ifelse, Vec
Base.Sort.midpoint(lo::AbstractSIMDVector{W,I}, hi::AbstractSIMDVector{W,I}) where {W,I<:Integer} = lo + ((hi - lo) >>> 0x01)
Base.getindex(A::AbstractArray, i::Vec{W}) where W = Vec{W,eltype(A)}(getindex(A,[Tuple(i)...])...)
function searchsortedlast(v::AbstractVector, x::AbstractSIMDVector{W,I}, lo::T, hi::T, o::Base.Ordering) where {W,I,T<:Integer}
    u = Vec{W,T}(1)
    lo = lo - u
    hi = hi + u
    st = lo < hi - u
    @inbounds while vany(st)
        m = Base.Sort.midpoint(lo, hi)
        b = (x < v[m]) & st 
        hi = ifelse(b, m, hi)
        lo = ifelse(b, lo, m)
        st = lo < hi - u
    end
    return lo
end

For my small toy example from the beginning, this gives the correct result:

using DataInterpolations
using LoopVectorization
u = [14.7, 11.51, 10.41, 14.95, 12.24, 11.22]
t = [0.0, 62.25, 109.66, 162.66, 205.8, 252.3]
A = CubicSpline(u,t)

function simd_test(A)
    f = x -> A(x)
    x_arr = collect(range(10.5,14.0, 100))
    y_arr = similar(x_arr)
    y_simd_arr = similar(x_arr)

    for i in eachindex(x_arr)
        y_arr[i] = A(x_arr[i])
    end

    @turbo for i in eachindex(x_arr)
        y_simd_arr[i] = A(x_arr[i])
    end
    return y_arr ≈ y_simd_arr
end
@assert simd_test(A) # --> true

The solution I came up with is quite rough and needs some more thought put in to work with all AbstractSIMDVector.
getindex(), for example, should take a AbstractSIMDVector{W} instead of Vec{W}, but I haven't found a way to make that type stable yet:

Base.getindex(A::AbstractArray, i::AbstractSIMDVector{W}) where W = AbstractSIMDVector{W,eltype(A)}(getindex(A,[Tuple(i)...])...)

chriselrod · 2022-08-05T12:40:25Z

@inline function Base.getindex(A::AbstractArray, i::AbstractSIMD...)
    VectorizationBase.vload(stridedpointer(A), i)
end

This should be much faster than your current version.

Because your current version is more specific, it would be preferred by dispatch, so make sure to either start a new Julia session or define a ::Vec version here as well.

I'd once thought about adding methods like this to VectorizationBase. Not sure why I didn't.
IIRC, it was annoying because of ambiguities.
E.g., do I also have to add

@inline function Base.getindex(A::AbstractArray, j::AbstractSIMD, i::Union{Integer,AbstractSIMD}...)
    VectorizationBase.vload(stridedpointer(A), (j,i...))
end
@inline function Base.getindex(A::AbstractArray, k::Integer, j::AbstractSIMD, i::Union{Integer,AbstractSIMD}...)
    VectorizationBase.vload(stridedpointer(A), (k,j,i...))
end

etc to support using Integers as well as AbstractSIMD.

maj0e · 2022-08-06T18:19:10Z

With your version of getindex the code is actually type stable and my radiance calculation as a whole is around 2x faster than the previous unvectorized version (judging from the profiler this seems to be near optimal).

The remaining question for me is, whether you want any of that in upstream?

I think there are multiple options:

getindex and searchsortedlast in VectorizationBase
searchsortedlast in VectorizationBase calling vload directly, vectorized version of interpolate in DataInterpolators
None of that and I keep the vectorised version only for me.

Option 2 would avoid ambiguities with getindex, but would require more code. What do you think?

Anyways, I'm pretty happy with the result and achieved my goal to speed up the previous numpy version significantly (20x - 24x for my benchmarks). Thanks for guiding me through this.

chriselrod · 2022-08-06T19:03:56Z

20x is something I like to hear!

I think "1." is good. We don't need all getindex to work.

Although we may also prefer calling vload directly.
Also, I'd mark everything as @inline, as llvmcall is considered to be much more expensive than most of these intrinsics actually are.

maj0e · 2022-08-12T04:09:57Z

JuliaSIMD/VectorizationBase.jl#90 was merged. Thanks again for your help!

maj0e mentioned this issue Aug 11, 2022

Add searchsortedlast and getindex JuliaSIMD/VectorizationBase.jl#90

Merged

maj0e closed this as completed Aug 12, 2022

moble mentioned this issue Mar 1, 2023

Speed up interpolation of many points by using correlated guesses #142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoopVectorization.jl support #123

LoopVectorization.jl support #123

maj0e commented Aug 4, 2022

ChrisRackauckas commented Aug 4, 2022

maj0e commented Aug 4, 2022

ChrisRackauckas commented Aug 4, 2022

chriselrod commented Aug 4, 2022

chriselrod commented Aug 4, 2022 •

edited

ChrisRackauckas commented Aug 4, 2022

chriselrod commented Aug 4, 2022 •

edited

chriselrod commented Aug 4, 2022

maj0e commented Aug 5, 2022

chriselrod commented Aug 5, 2022 •

edited

maj0e commented Aug 6, 2022

chriselrod commented Aug 6, 2022 •

edited

maj0e commented Aug 12, 2022

LoopVectorization.jl support #123

LoopVectorization.jl support #123

Comments

maj0e commented Aug 4, 2022

ChrisRackauckas commented Aug 4, 2022

maj0e commented Aug 4, 2022

ChrisRackauckas commented Aug 4, 2022

chriselrod commented Aug 4, 2022

chriselrod commented Aug 4, 2022 • edited

ChrisRackauckas commented Aug 4, 2022

chriselrod commented Aug 4, 2022 • edited

chriselrod commented Aug 4, 2022

maj0e commented Aug 5, 2022

chriselrod commented Aug 5, 2022 • edited

maj0e commented Aug 6, 2022

chriselrod commented Aug 6, 2022 • edited

maj0e commented Aug 12, 2022

chriselrod commented Aug 4, 2022 •

edited

chriselrod commented Aug 4, 2022 •

edited

chriselrod commented Aug 5, 2022 •

edited

chriselrod commented Aug 6, 2022 •

edited