Vectorization macro #43

aminya · 2020-03-02T21:22:12Z

It would be nice if we provide a macro that replaces functions with their vectorized version.

Like @ivm @. sin(x) would replace this with IntelVectorMath function, and @applacc @. sin(x) calls AppleAccelerate.

We can provide such macros from IntelVectorMath.jl too, or else maybe having all of them in one place like inside LoopVectorization.jl.

cc: @chriselrod

The major improvement these provide is that they're vectorized. If x is a scalar, then there isn't much benefit, if there is any at all.
Version of LoopVectorization provided an @vectorize macro (that has since been removed) which naively swapped calls, and made loops incremented (ie, instead of 1:N, it would be 1:W:N, plus some code to handle the remainder). @avx does this better.

If they are a vector, calling @avx sin.(x) or IntelVectorMath.sin(x) work (although a macro could search a whole block of code and swap them to use IntelVectorMath.

Related: #42
Came up in: #22 (comment)

The text was updated successfully, but these errors were encountered:

KristofferC · 2020-03-02T22:02:55Z

In my opinion, it is best for this package to be a "boring" backend package that just provides the methods from VML in a simple VML.function(::Vector) type. Packages that want to do fancy optimizations (like LoopVectorization) can implement whatever macros or tricks they want on top of it.

aminya · 2020-03-02T22:11:39Z

In my opinion, it is best for this package to be a "boring" backend package that just provides the methods from VML in a simple VML.function(::Vector) type. Packages that want to do fancy optimizations (like LoopVectorization) can implement whatever macros or tricks they want on top of it.

I agree. Maybe it is better to transfer this issue to LoopVectorization.

Crown421 · 2020-03-02T22:12:50Z

There is the Vectorize.jl package, that used to combine all vectorization libraries and picks the fastest one.
I think this could be a good candidate to include such a macro.

As I mentioned in #22, I think IntelVectorMath has reached the limit of its scope. It is very nice and lean now, does exactly what it says on the tin, fairly robustly.
Once a few small things have been sorted this could be made 1.0 and left there for a while.

aminya · 2020-03-02T22:14:29Z

I prefer LoopVectorization because it already has a @avx macro, and we can integrate IntelVectorMath easily to that one.

Vectorize.jl isn't updated for a while and my old issue is still open without any response:
rprechelt/Vectorize.jl#25

Crown421 · 2020-03-02T22:16:43Z

Yes, Vectorize would need to be updated/ revived completely. If you want to include AppleAccelerate as mentioned in the OP, I think that would be the way to go.

chriselrod · 2020-03-02T23:52:06Z

I've been planning on adding "loop splitting" support in LoopVectorization for a little while now (splitting one loop into several).
It would be possible to extend this to moving special functions into their own "loop" (a single vectorized call) and using VML (or some other library).

I would prefer "short vector" functions in general. Wouldn't require any changes to the library to support, nor would it require special casing. E.g, this works well with AVX2:

julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  2.131 μs (0 allocations: 0 bytes)
462.0132368439299

julia> @btime triangle_logdet($U)
  1.076 μs (0 allocations: 0 bytes)
462.0132368439296

julia> Float64(sum(log ∘ big, diag(U)))
462.0132368439296

Presumably, VML does not handle vectors with a stride other than 1, which would force me to copy the elements, log them, and then sum them if I wanted to use it there.
Assuming it's able to use some pre-allocated buffer...

julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  697.691 ns (0 allocations: 0 bytes)
462.0132368439296

It looks like all that effort would pay off, so I'm open to it.
Long term I would still be in favor of implementing more of these special functions in Julia or LLVM, but this may be the better short term move. I also don't see many people jumping at the opportunity to implement SIMD versions of special functions (myself included).

Too bad VML isn't more expansive. Adding it wouldn't do much to increase the number of special functions currently supported by SLEEFPirates/LoopVectorization.
I've been wanting a digamma function, for example. I'll probably try the approach suggested by Wikipedia.

How well does VML perform on AMD? Is that something I'd have to worry about?

EDIT:
With AVX512:

julia> using LinearAlgebra, LoopVectorization, IntelVectorMath, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  1.426 μs (0 allocations: 0 bytes)
463.5193875385334

julia> @btime triangle_logdet($U)
  234.677 ns (0 allocations: 0 bytes)
463.5193875385336

julia> Float64(sum(log ∘ big, diag(U)))
463.51938753853364

julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  411.110 ns (0 allocations: 0 bytes)
463.51938753853364

With AVX512, it uses this log definition. I'd be more inclined to add something similar for AVX2. For this benchmark, the Intel compilers produce faster code.

Crown421 · 2020-03-03T10:52:40Z

I will have access to an AMD processor on Friday, I will have a look then.
Also for the purposes of #32

aminya · 2020-03-03T22:39:19Z

I've been planning on adding "loop splitting" support in LoopVectorization for a little while now (splitting one loop into several).
It would be possible to extend this to moving special functions into their own "loop" (a single vectorized call) and using VML (or some other library).

I would prefer "short vector" functions in general. Wouldn't require any changes to the library to support, nor would it require special casing. E.g, this works well with AVX2:
julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  2.131 μs (0 allocations: 0 bytes)
462.0132368439299

julia> @btime triangle_logdet($U)
  1.076 μs (0 allocations: 0 bytes)
462.0132368439296

julia> Float64(sum(log ∘ big, diag(U)))
462.0132368439296
Presumably, VML does not handle vectors with a stride other than 1, which would force me to copy the elements, log them, and then sum them if I wanted to use it there.
Assuming it's able to use some pre-allocated buffer...
julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  697.691 ns (0 allocations: 0 bytes)
462.0132368439296
It looks like all that effort would pay off, so I'm open to it.
Long term I would still be in favor of implementing more of these special functions in Julia or LLVM, but this may be the better short term move. I also don't see many people jumping at the opportunity to implement SIMD versions of special functions (myself included).

Too bad VML isn't more expansive. Adding it wouldn't do much to increase the number of special functions currently supported by SLEEFPirates/LoopVectorization.
I've been wanting a digamma function, for example. I'll probably try the approach suggested by Wikipedia.

How well does VML perform on AMD? Is that something I'd have to worry about?

EDIT:
With AVX512:
julia> using LinearAlgebra, LoopVectorization, IntelVectorMath, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  1.426 μs (0 allocations: 0 bytes)
463.5193875385334

julia> @btime triangle_logdet($U)
  234.677 ns (0 allocations: 0 bytes)
463.5193875385336

julia> Float64(sum(log ∘ big, diag(U)))
463.51938753853364

julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  411.110 ns (0 allocations: 0 bytes)
463.51938753853364
With AVX512, it uses this log definition. I'd be more inclined to add something similar for AVX2. For this benchmark, the Intel compilers produce faster code .

Thank you for this detailed answer! @chriselrod

I just wanted to clarify the thing I mean in this issue, so everyone is on the same page.

We can consider 3 kinds of syntax for the macro (I use @ivm to avoid confusion):

A simple macro that only searches the given Expr for the functions that IntelVectorMath provides and adds IVM. before their name:

a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a)

should be translated to:

IVM.sin(a) .* IVM.cos(a) .* sum.(a)

A macro that converts broadcast to IVM call (which I think is more inline with your example):

a = rand(100)
@ivm sin.(a) .* cos.(a)

which similar to 1 is translated to:

IVM.sin(a) .* IVM.cos(a)

But in this case other functions can use a for loop with @avx on them:

a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a)

should be translated to:

out = Vector{eltype(a)}(undef, length(a))

temp = IVM.sin(a) * IVM.cos(a) 
@avx for i=1:length(a)
  out[i] = temp * sum(a[i])
end
out

or similar to (2) but more efficient (probably). We can fuse the loops (internal IntelVectorMath loop and the for loop) together and use IntelVectorMath only for 1 element:

out = Vector{eltype(a)}(undef, length(a))
@avx for i=1:length(a)
  out[i] = IVM.sin(a[i])[1] * IVM.cos(a[i])[1] * sum(a[i])
end
out

So which one is the syntax that we want to consider?

KristofferC · 2020-03-03T22:43:07Z

I think this issue can be closed on the basis that it is likely that advanced macro rewrites of Julia code are likely out of the scope of the package.

aminya · 2020-03-03T22:44:35Z

I think this issue can be closed on the basis that it is likely that advanced macro rewrites of Julia code are likely out of the scope of the package.

I would like to transfer it to LoopVectorization.jl. I don't have access to do that. Maybe @chriselrod can transfer it for me.

I think at least the 1st macro can be implemented in this package. It is just a find and replace macro.

KristofferC · 2020-03-03T22:50:08Z

I think at least the 1st macro can be implemented in this package. It is just a find and replace macro.

No, it isn't really because macros operate on syntax and you don't know if someone has done using SomeOtherLibm: sin and the sin symbol means something different from Base.sin. Let's keep this package unambiguous and simple.

aminya · 2020-03-03T23:10:20Z

I think at least the 1st macro can be implemented in this package. It is just a find and replace macro.

No, it isn't really because macros operate on syntax and you don't know if someone has done using SomeOtherLibm: sin and the sin symbol means something different from Base.sin. Let's keep this package unambiguous and simple.

When someone uses @ivm that means they want to transform sin to IVM.sin.
Multiple lib usage:

(@ivm sin.(a).*sin.(b)).*Base.sin.(a)

KristofferC · 2020-03-03T23:35:15Z

That is not a good idea because the semantics of broadcasting is to fuse everything into a single kernel.

aminya · 2020-03-04T00:50:32Z

That is not a good idea because the semantics of broadcasting is to fuse everything into a single kernel.

That's why I recommended 3rd syntax. Actually, I am totally OK to move this issue to LoopVectorization.

KristofferC · 2020-03-04T08:33:38Z

Ok, let's move it there then.

aminya · 2020-03-06T22:54:37Z

@chriselrod Could you transfer this issue to LoopVectorization? I don't have access.

chriselrod · 2020-03-06T23:29:05Z

@aminya I think I'd need committer rights on IntelVectorMath to transfer an issue away from it.
Someone else can transfer it, or you could file a new issue and link this one.

aminya · 2020-03-07T00:00:21Z

I see. I will move it manually then.

aminya added Enhancement Usability Future labels Mar 2, 2020

aminya mentioned this issue Mar 4, 2020

Scalar Calculation? #44

Open

KristofferC closed this as completed Mar 4, 2020

aminya mentioned this issue Mar 7, 2020

Predefined functions Parallelization (e.g. IntelVectorMath functions) JuliaSIMD/LoopVectorization.jl#68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorization macro #43

Vectorization macro #43

aminya commented Mar 2, 2020 •

edited

Loading

KristofferC commented Mar 2, 2020

aminya commented Mar 2, 2020

Crown421 commented Mar 2, 2020 •

edited

Loading

aminya commented Mar 2, 2020 •

edited

Loading

Crown421 commented Mar 2, 2020

chriselrod commented Mar 2, 2020 •

edited

Loading

Crown421 commented Mar 3, 2020

aminya commented Mar 3, 2020 •

edited

Loading

KristofferC commented Mar 3, 2020

aminya commented Mar 3, 2020 •

edited

Loading

KristofferC commented Mar 3, 2020

aminya commented Mar 3, 2020 •

edited

Loading

KristofferC commented Mar 3, 2020

aminya commented Mar 4, 2020 •

edited

Loading

KristofferC commented Mar 4, 2020

aminya commented Mar 6, 2020

chriselrod commented Mar 6, 2020

aminya commented Mar 7, 2020

Vectorization macro #43

Vectorization macro #43

Comments

aminya commented Mar 2, 2020 • edited Loading

KristofferC commented Mar 2, 2020

aminya commented Mar 2, 2020

Crown421 commented Mar 2, 2020 • edited Loading

aminya commented Mar 2, 2020 • edited Loading

Crown421 commented Mar 2, 2020

chriselrod commented Mar 2, 2020 • edited Loading

Crown421 commented Mar 3, 2020

aminya commented Mar 3, 2020 • edited Loading

KristofferC commented Mar 3, 2020

aminya commented Mar 3, 2020 • edited Loading

KristofferC commented Mar 3, 2020

aminya commented Mar 3, 2020 • edited Loading

KristofferC commented Mar 3, 2020

aminya commented Mar 4, 2020 • edited Loading

KristofferC commented Mar 4, 2020

aminya commented Mar 6, 2020

chriselrod commented Mar 6, 2020

aminya commented Mar 7, 2020

aminya commented Mar 2, 2020 •

edited

Loading

Crown421 commented Mar 2, 2020 •

edited

Loading

aminya commented Mar 2, 2020 •

edited

Loading

chriselrod commented Mar 2, 2020 •

edited

Loading

aminya commented Mar 3, 2020 •

edited

Loading

aminya commented Mar 3, 2020 •

edited

Loading

aminya commented Mar 3, 2020 •

edited

Loading

aminya commented Mar 4, 2020 •

edited

Loading