-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoopVectorization.jl support #123
Comments
You can't SIMD through that easily. It's a non-static search. I wouldn't expect LoopVectorization to work on that. |
Thanks for the quick feedback. I don't have much experience with vectorized code and was already afraid, that there won't be an easy solution. |
Yeah, that hoisting out of the loop would make the latter code vectorize. That's what would be needed here. |
I could probably write SIMD versions. |
I'd say this would be a good first issue for someone else to try. When trying to write a SIMD |
But |
You can define SIMD versions of functions like
E.g., could define a working |
It'll be a while before the LLVM version can do that automatically, but that's the dream of course! |
I decided to give it a shot and implement a SIMD version. import Base.searchsortedlast
import VectorizationBase.AbstractSIMDVector
using LoopVectorization: vany, ifelse, Vec
Base.Sort.midpoint(lo::AbstractSIMDVector{W,I}, hi::AbstractSIMDVector{W,I}) where {W,I<:Integer} = lo + ((hi - lo) >>> 0x01)
Base.getindex(A::AbstractArray, i::Vec{W}) where W = Vec{W,eltype(A)}(getindex(A,[Tuple(i)...])...)
function searchsortedlast(v::AbstractVector, x::AbstractSIMDVector{W,I}, lo::T, hi::T, o::Base.Ordering) where {W,I,T<:Integer}
u = Vec{W,T}(1)
lo = lo - u
hi = hi + u
st = lo < hi - u
@inbounds while vany(st)
m = Base.Sort.midpoint(lo, hi)
b = (x < v[m]) & st
hi = ifelse(b, m, hi)
lo = ifelse(b, lo, m)
st = lo < hi - u
end
return lo
end For my small toy example from the beginning, this gives the correct result: using DataInterpolations
using LoopVectorization
u = [14.7, 11.51, 10.41, 14.95, 12.24, 11.22]
t = [0.0, 62.25, 109.66, 162.66, 205.8, 252.3]
A = CubicSpline(u,t)
function simd_test(A)
f = x -> A(x)
x_arr = collect(range(10.5,14.0, 100))
y_arr = similar(x_arr)
y_simd_arr = similar(x_arr)
for i in eachindex(x_arr)
y_arr[i] = A(x_arr[i])
end
@turbo for i in eachindex(x_arr)
y_simd_arr[i] = A(x_arr[i])
end
return y_arr ≈ y_simd_arr
end
@assert simd_test(A) # --> true The solution I came up with is quite rough and needs some more thought put in to work with all Base.getindex(A::AbstractArray, i::AbstractSIMDVector{W}) where W = AbstractSIMDVector{W,eltype(A)}(getindex(A,[Tuple(i)...])...) |
@inline function Base.getindex(A::AbstractArray, i::AbstractSIMD...)
VectorizationBase.vload(stridedpointer(A), i)
end This should be much faster than your current version. Because your current version is more specific, it would be preferred by dispatch, so make sure to either start a new Julia session or define a I'd once thought about adding methods like this to @inline function Base.getindex(A::AbstractArray, j::AbstractSIMD, i::Union{Integer,AbstractSIMD}...)
VectorizationBase.vload(stridedpointer(A), (j,i...))
end
@inline function Base.getindex(A::AbstractArray, k::Integer, j::AbstractSIMD, i::Union{Integer,AbstractSIMD}...)
VectorizationBase.vload(stridedpointer(A), (k,j,i...))
end etc to support using |
With your version of getindex the code is actually type stable and my radiance calculation as a whole is around 2x faster than the previous unvectorized version (judging from the profiler this seems to be near optimal). The remaining question for me is, whether you want any of that in upstream? I think there are multiple options:
Option 2 would avoid ambiguities with getindex, but would require more code. What do you think? Anyways, I'm pretty happy with the result and achieved my goal to speed up the previous numpy version significantly (20x - 24x for my benchmarks). Thanks for guiding me through this. |
I think "1." is good. We don't need all Although we may also prefer calling |
JuliaSIMD/VectorizationBase.jl#90 was merged. Thanks again for your help! |
Hi,
while implementing a solution to a radiative transfer problem, I encountered a problem when trying to use a
CubicSpline
withSIMD instructions.
Basically I have a integration routine using the
@turbo
macro from LoopVectorization.jl, where the integrand contains an interpolator from DataInterpolations.jl.Here is a small example reproducing the problem:
Here is the output:
The vectorization fails in
searchsortedlast(...)
, when callingisless
with vectorized arguments.I tried to define an adapted version of
searchsortedlast
, but to no avail so far.Would be great if you could share your opinion on how to approach or circumvent this problem.
Judging from my benchmarks with non interpolated functions I would get a massive 3x speedup from using LoopVectorization.jl.
The text was updated successfully, but these errors were encountered: