-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorization macro #43
Comments
In my opinion, it is best for this package to be a "boring" backend package that just provides the methods from VML in a simple |
I agree. Maybe it is better to transfer this issue to LoopVectorization. |
There is the Vectorize.jl package, that used to combine all vectorization libraries and picks the fastest one. As I mentioned in #22, I think IntelVectorMath has reached the limit of its scope. It is very nice and lean now, does exactly what it says on the tin, fairly robustly. |
I prefer LoopVectorization because it already has a Vectorize.jl isn't updated for a while and my old issue is still open without any response: |
Yes, Vectorize would need to be updated/ revived completely. If you want to include AppleAccelerate as mentioned in the OP, I think that would be the way to go. |
I've been planning on adding "loop splitting" support in LoopVectorization for a little while now (splitting one loop into several). I would prefer "short vector" functions in general. Wouldn't require any changes to the library to support, nor would it require special casing. E.g, this works well with AVX2: julia> using LinearAlgebra, LoopVectorization, BenchmarkTools
julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;
julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
ld = zero(eltype(A))
@avx for i in 1:size(A,1)
ld += log(A[i,i])
end
ld
end
triangle_logdet (generic function with 1 method)
julia> @btime logdet($U)
2.131 μs (0 allocations: 0 bytes)
462.0132368439299
julia> @btime triangle_logdet($U)
1.076 μs (0 allocations: 0 bytes)
462.0132368439296
julia> Float64(sum(log ∘ big, diag(U)))
462.0132368439296 Presumably, VML does not handle vectors with a stride other than 1, which would force me to copy the elements, log them, and then sum them if I wanted to use it there. julia> y3 = similar(diag(U));
julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
@avx for i ∈ 1:size(A,1)
y[i] = A[i,i]
end
IntelVectorMath.log!(y, y)
ld = zero(eltype(y))
@avx for i ∈ eachindex(y)
ld += y[i]
end
ld
end
triangle_logdet_vml! (generic function with 1 method)
julia> @btime triangle_logdet_vml!($y3, $U)
697.691 ns (0 allocations: 0 bytes)
462.0132368439296 It looks like all that effort would pay off, so I'm open to it. Too bad VML isn't more expansive. Adding it wouldn't do much to increase the number of special functions currently supported by SLEEFPirates/LoopVectorization. How well does VML perform on AMD? Is that something I'd have to worry about? EDIT: julia> using LinearAlgebra, LoopVectorization, IntelVectorMath, BenchmarkTools
julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;
julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
ld = zero(eltype(A))
@avx for i in 1:size(A,1)
ld += log(A[i,i])
end
ld
end
triangle_logdet (generic function with 1 method)
julia> @btime logdet($U)
1.426 μs (0 allocations: 0 bytes)
463.5193875385334
julia> @btime triangle_logdet($U)
234.677 ns (0 allocations: 0 bytes)
463.5193875385336
julia> Float64(sum(log ∘ big, diag(U)))
463.51938753853364
julia> y3 = similar(diag(U));
julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
@avx for i ∈ 1:size(A,1)
y[i] = A[i,i]
end
IntelVectorMath.log!(y, y)
ld = zero(eltype(y))
@avx for i ∈ eachindex(y)
ld += y[i]
end
ld
end
triangle_logdet_vml! (generic function with 1 method)
julia> @btime triangle_logdet_vml!($y3, $U)
411.110 ns (0 allocations: 0 bytes)
463.51938753853364 With AVX512, it uses this log definition. I'd be more inclined to add something similar for AVX2. For this benchmark, the Intel compilers produce faster code. |
I will have access to an AMD processor on Friday, I will have a look then. |
Thank you for this detailed answer! @chriselrod I just wanted to clarify the thing I mean in this issue, so everyone is on the same page. We can consider 3 kinds of syntax for the macro (I use
a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a) should be translated to: IVM.sin(a) .* IVM.cos(a) .* sum.(a)
a = rand(100)
@ivm sin.(a) .* cos.(a) which similar to 1 is translated to: IVM.sin(a) .* IVM.cos(a) But in this case other functions can use a a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a) should be translated to: out = Vector{eltype(a)}(undef, length(a))
temp = IVM.sin(a) * IVM.cos(a)
@avx for i=1:length(a)
out[i] = temp * sum(a[i])
end
out
out = Vector{eltype(a)}(undef, length(a))
@avx for i=1:length(a)
out[i] = IVM.sin(a[i])[1] * IVM.cos(a[i])[1] * sum(a[i])
end
out So which one is the syntax that we want to consider? |
I think this issue can be closed on the basis that it is likely that advanced macro rewrites of Julia code are likely out of the scope of the package. |
I would like to transfer it to LoopVectorization.jl. I don't have access to do that. Maybe @chriselrod can transfer it for me. I think at least the 1st macro can be implemented in this package. It is just a find and replace macro. |
No, it isn't really because macros operate on syntax and you don't know if someone has done |
When someone uses (@ivm sin.(a).*sin.(b)).*Base.sin.(a) |
That is not a good idea because the semantics of broadcasting is to fuse everything into a single kernel. |
That's why I recommended 3rd syntax. Actually, I am totally OK to move this issue to LoopVectorization. |
Ok, let's move it there then. |
@chriselrod Could you transfer this issue to LoopVectorization? I don't have access. |
@aminya I think I'd need committer rights on |
I see. I will move it manually then. |
It would be nice if we provide a macro that replaces functions with their vectorized version.
Like
@ivm @. sin(x)
would replace this with IntelVectorMath function, and@applacc @. sin(x)
calls AppleAccelerate.We can provide such macros from IntelVectorMath.jl too, or else maybe having all of them in one place like inside LoopVectorization.jl.
cc: @chriselrod
Related: #42
Came up in: #22 (comment)
The text was updated successfully, but these errors were encountered: