Add support for @simd #5355

Merged
merged 9 commits into from Mar 31, 2014

Conversation

Projects
None yet
@ArchRobison
Contributor

ArchRobison commented Jan 10, 2014

This pull request enables the LLVM loop vectorizer. It's not quite ready for production. I'd like feedback and help fixing some issues. The overall design is explained in this comment to issue #4786, except that it no longer relies on the "banana interface"mentioned in that comment.

Here is an example that it can vectorize when a is of type Float32, and x and y are of type Array{Float32,1}:

function saxpy( a, x, y )
    @simd for i=1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

I've seen the vectorized version run 3x faster than the unvectorized version when data fits in cache. When AVX can be enabled, the results are likely even better.

Programmers can put the @simd macro in front of one-dimensional for loops that have ranges of the form m:n and the type of the loop index supports < and `+``. The decoration is guaranteeing that the loop does not rely on wrap-around behavior and the loop iterations are safe to execute in parallel, even if chunks are done in lockstep.

The patch implements type-based alias analysis , which may help LLVM optimize better in general, and is essential for vectorization. The name "type-based alias analysis" is a bit of a misnomer, since it's really based on hierarchically partitioning memory. I've implemented it for Julia assuming that type-punning is never done for parts of data structures that users cannot access directly, but that user data can be type-punned freely.

Problems that I seek advice on:

  • The @simd macro is not found. Currently I have to do the following within the REPL:
include("base/simdloop.jl")
using SimdLoop.@simd

I tried to copy the way@printf is defined/exported, but something is wrong with my patch. What?

  • LLVM 3.3 disallows attaching metadata to a block, so I've attached it to an instruction in the block. It's kind of ad-hoc, but seems to work. Is there a better way to do it?
  • An alternative to attaching metadata is to eliminate src/llvm-simdloop.cpp and instead rely on LLVM's auto-vectorization capability, which inserts memory dependence tests. That indeed does work for the saxpy example above, i.e. it vectorizes without the support src/llvm-simdloop.cpp. However, @simd would still be necessary to tranform the loop into a form such that LLVM can compute a trip count.
  • An alternative to the trip-count issue is to eliminate @simd altogether and instead somehow ensure that m:n is lowered to a form for which LLVM can compute a trip count.
  • I'm a neophyte at writing macros, base/simdloop.jl could use a review by an expert.

Apologies for the useless comment:

This file defines two entry points:

I just noticed it, but being late on a Friday, I'll fix it later. It's supposed to say that one entry point is for marking simd loops and the other is for later lowering marked loops.

Thanks to @simonster for his information on enabling the loop vectorizer. It was a big help to get me going.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 10, 2014

Member

Amazing!

Member

simonster commented Jan 10, 2014

Amazing!

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 10, 2014

Member

😺

Member

jiahao commented Jan 10, 2014

😺

@johnmyleswhite

This comment has been minimized.

Show comment
Hide comment
Member

johnmyleswhite commented Jan 10, 2014

💯

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 11, 2014

Member

Amazing, I look forward to reading this in detail. Even just the TBAA part is great to have.

Member

JeffBezanson commented Jan 11, 2014

Amazing, I look forward to reading this in detail. Even just the TBAA part is great to have.

@@ -175,6 +175,9 @@ using .I18n
using .Help
push!(I18n.CALLBACKS, Help.clear_cache)
+# SIMD loops
+include("simdloop.jl")

This comment has been minimized.

@nolta

nolta Jan 12, 2014

Member

I think you might need a

importall .SimdLoop

here?

@nolta

nolta Jan 12, 2014

Member

I think you might need a

importall .SimdLoop

here?

This comment has been minimized.

@ArchRobison

ArchRobison Jan 13, 2014

Contributor

Thanks! Now added.

@ArchRobison

ArchRobison Jan 13, 2014

Contributor

Thanks! Now added.

@lindahua

This comment has been minimized.

Show comment
Hide comment
@lindahua

lindahua Jan 15, 2014

Member

Eagerly looking forward to this.

Member

lindahua commented Jan 15, 2014

Eagerly looking forward to this.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Jan 16, 2014

Member

Likewise. Waiting for this to land.

Member

ViralBShah commented Jan 16, 2014

Likewise. Waiting for this to land.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 16, 2014

Contributor

One feature of the pull request is that it enables auto-vectorization of some loops without @simd. But it's quirky, and the underlying reason for the quirkiness needs discussion because with a small change, we might be able to enable wider use of auto-vectorization in Julia. Consider the following example:

function saxpy( a, x, y )
    for i in 1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

LLVM will not auto-vectorize it because cannot compute a trip count. Now change 1:length(x) to (1:length(x))+0. Then (with the current PR) the example does vectorize!

The root issue is that the documented way that Julia lowers for loops works just fine for the vectorizer. But there is an undocumented optimization that gets in the way. If a loop has the form for i in a:b, then it is custom-lowered differently. (See 'for in src/julia-syntax.scm). The custom lowering likely helps compilation time by short-cutting through a lot of analysis and transform. Regrettably it puts the loop in a form where LLVM cannot compute a trip count. Here's sketch of the form (I'm abstracting out some details):

i = a
while i<=b 
    ...
    i = i+1

Assume a and b are of type Int. LLVM cannot compute a trip count because the loop is an infinite loop if b=typemax(Int). The "no signed wrap" (see #3929) enables LLVM to disallow this possibility. So I think we should consider one of two changes to short-cut lowering of for loops:

  • Somehow set the "no signed wrap" flag on the right add instruction, by using an intrinsic per the suggestion of @simonster.
  • Change the lowering to:
i = a
while i<b+1
    ...
    i = i+1

I think an annotation such as @simd is essential to trickier cases where run-time memory disambiguation is impractical. But I think we should consider whether the "short cut" lowering of for loops should be more friendly to auto-vectorization.

Comments?

Contributor

ArchRobison commented Jan 16, 2014

One feature of the pull request is that it enables auto-vectorization of some loops without @simd. But it's quirky, and the underlying reason for the quirkiness needs discussion because with a small change, we might be able to enable wider use of auto-vectorization in Julia. Consider the following example:

function saxpy( a, x, y )
    for i in 1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

LLVM will not auto-vectorize it because cannot compute a trip count. Now change 1:length(x) to (1:length(x))+0. Then (with the current PR) the example does vectorize!

The root issue is that the documented way that Julia lowers for loops works just fine for the vectorizer. But there is an undocumented optimization that gets in the way. If a loop has the form for i in a:b, then it is custom-lowered differently. (See 'for in src/julia-syntax.scm). The custom lowering likely helps compilation time by short-cutting through a lot of analysis and transform. Regrettably it puts the loop in a form where LLVM cannot compute a trip count. Here's sketch of the form (I'm abstracting out some details):

i = a
while i<=b 
    ...
    i = i+1

Assume a and b are of type Int. LLVM cannot compute a trip count because the loop is an infinite loop if b=typemax(Int). The "no signed wrap" (see #3929) enables LLVM to disallow this possibility. So I think we should consider one of two changes to short-cut lowering of for loops:

  • Somehow set the "no signed wrap" flag on the right add instruction, by using an intrinsic per the suggestion of @simonster.
  • Change the lowering to:
i = a
while i<b+1
    ...
    i = i+1

I think an annotation such as @simd is essential to trickier cases where run-time memory disambiguation is impractical. But I think we should consider whether the "short cut" lowering of for loops should be more friendly to auto-vectorization.

Comments?

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 17, 2014

Member

This seems kind of like a bug in the current lowering, since for i = typemax(Int):typemax(Int); end should probably not be an infinite loop. Changing the lowering to i < b+1 would cause a loop ending in typemax(Int) not to be executed at all, which is still not quite right (although if the current behavior is acceptable, this seems equally acceptable). If we care about handling loops ending in typemax(Int), it seems like we could lower to:

if b >= a
  i = a
  while i != b+1
      ...
      i = i+1
  end
end

Can LLVM compute a trip count in that case?

Member

simonster commented Jan 17, 2014

This seems kind of like a bug in the current lowering, since for i = typemax(Int):typemax(Int); end should probably not be an infinite loop. Changing the lowering to i < b+1 would cause a loop ending in typemax(Int) not to be executed at all, which is still not quite right (although if the current behavior is acceptable, this seems equally acceptable). If we care about handling loops ending in typemax(Int), it seems like we could lower to:

if b >= a
  i = a
  while i != b+1
      ...
      i = i+1
  end
end

Can LLVM compute a trip count in that case?

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

Wow, it is quite satisfying that the shortcut hack works worse than the general case :)
This is indeed a bug.

It looks to me like @simonster 's solution is the only one that will handle full range. However, the Range type used in the general case can only have up to typemax(Int) elements. The special-case lowering could mimic that:

n = b-a+1
# error if range too big
c = 0
while c < n
    i = c+a
    ...
    c = c+1
end
Member

JeffBezanson commented Jan 17, 2014

Wow, it is quite satisfying that the shortcut hack works worse than the general case :)
This is indeed a bug.

It looks to me like @simonster 's solution is the only one that will handle full range. However, the Range type used in the general case can only have up to typemax(Int) elements. The special-case lowering could mimic that:

n = b-a+1
# error if range too big
c = 0
while c < n
    i = c+a
    ...
    c = c+1
end
@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 17, 2014

Member

If you move the check to the end of the loop, then the fact that it's typemax doesn't matter:

i = a - 1
goto check
while true
    # body
    label check
    i < b || break
    i += 1
end

Edit: fix starting value.

Member

StefanKarpinski commented Jan 17, 2014

If you move the check to the end of the loop, then the fact that it's typemax doesn't matter:

i = a - 1
goto check
while true
    # body
    label check
    i < b || break
    i += 1
end

Edit: fix starting value.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 17, 2014

Member

If you're willing to have an additional branch, then you can avoid the subtraction at the beginning.

Member

StefanKarpinski commented Jan 17, 2014

If you're willing to have an additional branch, then you can avoid the subtraction at the beginning.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 17, 2014

Contributor

Is the short-cut expected to be semantically equivalent to the long path? E.g., how finicky should we be about which signatures are expected for the types of the bounds? If I understand correctly, the lowering at this point is happening before type inference. Do we have any measurements on what the short-cut is buying in terms of JIT+execution time or code space? I wondering if perhaps the short-cut could be removed and whatever savings it provided could be made up somewhere else in the compilation chain.

Here are some tricky examples to consider in proposing shortcuts/semantics:

for i=0.0:.1:.25   # Fun with floating-point round.  Tripcount should be 3.
       println(i)
end
for j=typemin(Int):typemin(Int)+1   # Tripcount should be 2.
       println(j)
end
for k=typemax(Int)-1:typemax(Int) # Tripcount should be 2
       println(k)
end

All of these deliver the correct (or at least obvious :-)) results with the long path, but may go astray with some shortcut solutions.

Besides user expectations, something else to consider is the path through the rest of the compilation chain. I suspect that the loop optimizations will almost invariably transform a test-at-top loop into a test-at-bottom loop wrapped in a zero-trip guard, i.e. something like this:

if (loop-test) {
      loop-preheader (compute loop invariants, initialize induction variables)
      do {
          loop body
      } while(loop-test);
}

So if we lower a loop into this form in the first place for semantic reasons, we're probably not creating any extra code bloat since the compiler was going to do it anyway.

Contributor

ArchRobison commented Jan 17, 2014

Is the short-cut expected to be semantically equivalent to the long path? E.g., how finicky should we be about which signatures are expected for the types of the bounds? If I understand correctly, the lowering at this point is happening before type inference. Do we have any measurements on what the short-cut is buying in terms of JIT+execution time or code space? I wondering if perhaps the short-cut could be removed and whatever savings it provided could be made up somewhere else in the compilation chain.

Here are some tricky examples to consider in proposing shortcuts/semantics:

for i=0.0:.1:.25   # Fun with floating-point round.  Tripcount should be 3.
       println(i)
end
for j=typemin(Int):typemin(Int)+1   # Tripcount should be 2.
       println(j)
end
for k=typemax(Int)-1:typemax(Int) # Tripcount should be 2
       println(k)
end

All of these deliver the correct (or at least obvious :-)) results with the long path, but may go astray with some shortcut solutions.

Besides user expectations, something else to consider is the path through the rest of the compilation chain. I suspect that the loop optimizations will almost invariably transform a test-at-top loop into a test-at-bottom loop wrapped in a zero-trip guard, i.e. something like this:

if (loop-test) {
      loop-preheader (compute loop invariants, initialize induction variables)
      do {
          loop body
      } while(loop-test);
}

So if we lower a loop into this form in the first place for semantic reasons, we're probably not creating any extra code bloat since the compiler was going to do it anyway.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 17, 2014

Member

Maybe we should remove the special case handling altogether? At this point with range objects being immutable types and the compiler being quite smart about such, I suspect the special case may no longer be necssary. It originally was very necessary because neither of those things were true.

Member

StefanKarpinski commented Jan 17, 2014

Maybe we should remove the special case handling altogether? At this point with range objects being immutable types and the compiler being quite smart about such, I suspect the special case may no longer be necssary. It originally was very necessary because neither of those things were true.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 17, 2014

Member

Without special lowering, we have to make a function call to colon, which has to call the Range1 constructor. This appears to have noticeable overhead if the time to execute the loop is short. Let:

function f(A)
    c = 0.0
    for i = 1:10000000
        for j = 1:length(A)
            @inbounds c += A[j]
        end
    end
    c
end

function g(A)
    c = 0.0
    for i = 1:10000000
        rg = 1:length(A)
        for j = rg
            @inbounds c += A[j]
        end
    end
    c
end

The only difference here should be that f(A) gets the special lowering whereas g(A) does not. For A = rand(5), after compilation, f(A) is consistently almost twice as fast:

julia> @time f(A);
elapsed time: 0.03747795 seconds (64 bytes allocated)

julia> @time f(A);
elapsed time: 0.037112331 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066732369 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066190191 seconds (64 bytes allocated)

If A = rand(100), the difference is almost non-existent, but I don't think we should deoptimize small loops. OTOH, if we could fully inline colon and the optimizer can elide the non-negative length check for Range1 construction, maybe this would generate the same code as @JeffBezanson's proposal.

Member

simonster commented Jan 17, 2014

Without special lowering, we have to make a function call to colon, which has to call the Range1 constructor. This appears to have noticeable overhead if the time to execute the loop is short. Let:

function f(A)
    c = 0.0
    for i = 1:10000000
        for j = 1:length(A)
            @inbounds c += A[j]
        end
    end
    c
end

function g(A)
    c = 0.0
    for i = 1:10000000
        rg = 1:length(A)
        for j = rg
            @inbounds c += A[j]
        end
    end
    c
end

The only difference here should be that f(A) gets the special lowering whereas g(A) does not. For A = rand(5), after compilation, f(A) is consistently almost twice as fast:

julia> @time f(A);
elapsed time: 0.03747795 seconds (64 bytes allocated)

julia> @time f(A);
elapsed time: 0.037112331 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066732369 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066190191 seconds (64 bytes allocated)

If A = rand(100), the difference is almost non-existent, but I don't think we should deoptimize small loops. OTOH, if we could fully inline colon and the optimizer can elide the non-negative length check for Range1 construction, maybe this would generate the same code as @JeffBezanson's proposal.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

Getting rid of the special case would be great. I'll explore what extra inlining might get us here.

Member

JeffBezanson commented Jan 17, 2014

Getting rid of the special case would be great. I'll explore what extra inlining might get us here.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

LLVM seems to generate far more compact code with these definitions:

start(r::Range1) = r.start
next{T}(r::Range1{T}, i) = (i, oftype(T, i+1))
done(r::Range1, i) = i==(r.start+r.len)

With that plus full inlining I think we will be ok without the special case. Just need to make sure it can still vectorize the result.

Member

JeffBezanson commented Jan 17, 2014

LLVM seems to generate far more compact code with these definitions:

start(r::Range1) = r.start
next{T}(r::Range1{T}, i) = (i, oftype(T, i+1))
done(r::Range1, i) = i==(r.start+r.len)

With that plus full inlining I think we will be ok without the special case. Just need to make sure it can still vectorize the result.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

Another idea: use the Range1 type only for integers, and have it store start and stop instead of length. That way the start and stop values can simply be accepted with no checks, and the length method can throw an overflow error if the length can't be represented as an Int. The reason for this is that computing the length is the hard part, and you often don't need it.

Otherwise we are faced with the following:

  1. Check stop<start, set length to 0 if so
  2. Compute checked_add(checked_sub(stop,start),1) to check for over-long ranges
  3. Call Range1 constructor, which must check length<0 in case somebody calls the constructor directly

So there are 3 layers of checks, the third of which is redundant when called from colon. We could have a hidden unsafe constructor that elides check (3), for use by colon, but that's kind of a hack and only addresses a small piece.

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

Member

JeffBezanson commented Jan 17, 2014

Another idea: use the Range1 type only for integers, and have it store start and stop instead of length. That way the start and stop values can simply be accepted with no checks, and the length method can throw an overflow error if the length can't be represented as an Int. The reason for this is that computing the length is the hard part, and you often don't need it.

Otherwise we are faced with the following:

  1. Check stop<start, set length to 0 if so
  2. Compute checked_add(checked_sub(stop,start),1) to check for over-long ranges
  3. Call Range1 constructor, which must check length<0 in case somebody calls the constructor directly

So there are 3 layers of checks, the third of which is redundant when called from colon. We could have a hidden unsafe constructor that elides check (3), for use by colon, but that's kind of a hack and only addresses a small piece.

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 17, 2014

Contributor

I verified that the auto-vectorizer can vectorize this example, which I believe is equivalent to code after "full inlining" of @JeffBezanson 's changes to Range1.

function saxpy( a, x, y )
    r = 1:length(x)
    s = r.start
    while !(s==(r.start+r.len))
        (i,s) = (s,oftype(Int,s+1))
        @inbounds y[i] = y[i]+a*x[i];
    end
end
Contributor

ArchRobison commented Jan 17, 2014

I verified that the auto-vectorizer can vectorize this example, which I believe is equivalent to code after "full inlining" of @JeffBezanson 's changes to Range1.

function saxpy( a, x, y )
    r = 1:length(x)
    s = r.start
    while !(s==(r.start+r.len))
        (i,s) = (s,oftype(Int,s+1))
        @inbounds y[i] = y[i]+a*x[i];
    end
end
@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 17, 2014

Contributor

By the way, it's probably good to limit the shortcut to integer loops, or at least avoid any schemes that rely on floating-point induction variables. Otherwise round-off can cause surprises. Here's a surprise with the current Julia:

a=2.0^53
b=a+2
r = a:b
for i in r        # Performs 3 iterations as expected
    println(i)
end
for i in a:b      # Infinite loop
    println(i)
end
Contributor

ArchRobison commented Jan 17, 2014

By the way, it's probably good to limit the shortcut to integer loops, or at least avoid any schemes that rely on floating-point induction variables. Otherwise round-off can cause surprises. Here's a surprise with the current Julia:

a=2.0^53
b=a+2
r = a:b
for i in r        # Performs 3 iterations as expected
    println(i)
end
for i in a:b      # Infinite loop
    println(i)
end
@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

Clearly we need to just remove the special case. That will be a great change.

Member

JeffBezanson commented Jan 17, 2014

Clearly we need to just remove the special case. That will be a great change.

JeffBezanson added a commit that referenced this pull request Jan 17, 2014

remove special-case lowering for `for i = a:b` loops (ref #5355)
this fixes some edge-case loops that the special lowering did not
handle correctly.

colon() now checks for overflow in computing the length, which avoids
some buggy Range1s that used to be possible.

this required some changes to make sure Range1 is fast enough:
specialized start, done, next, and a hack to avoid one of the checks and
allow better inlining.

in general performance is about the same, but a few cases are actually
faster, since Range1 is now faster (comprehensions used Range1 instead
of the special-case lowering, for example). also, more loops should be
vectorizable when the appropriate LLVM passes are enabled. all that
plus better correctness and a simpler front-end, and I'm sold.
@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 18, 2014

Member

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

This seems quite sensible. I believe this actually addresses things like ranges of Char, BigInt, and other non-traditional types that you might want ranges of. There was another example recently, which I don't recall.

Member

StefanKarpinski commented Jan 18, 2014

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

This seems quite sensible. I believe this actually addresses things like ranges of Char, BigInt, and other non-traditional types that you might want ranges of. There was another example recently, which I don't recall.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 22, 2014

Contributor

Where in the manual should I document @simd? It's fundamentally about relaxing control flow, so doc/manual/control-flow.rst is a logical place. However, @simd is a bit esoteric and might be a distraction there. It's different than the parallel programming model, so doc/manual/parallel-computing.rst doesn't seem like the right place. Should I give @simd its own section in the manual?

Contributor

ArchRobison commented Jan 22, 2014

Where in the manual should I document @simd? It's fundamentally about relaxing control flow, so doc/manual/control-flow.rst is a logical place. However, @simd is a bit esoteric and might be a distraction there. It's different than the parallel programming model, so doc/manual/parallel-computing.rst doesn't seem like the right place. Should I give @simd its own section in the manual?

@ivarne

This comment has been minimized.

Show comment
Hide comment
@ivarne

ivarne Jan 22, 2014

Contributor

I would expect to find something like @inbounds and @simd to be in a performance chapter. They are both about making the user do something that ideally would be the compilers job.

How about performance-tips.rst?

Contributor

ivarne commented Jan 22, 2014

I would expect to find something like @inbounds and @simd to be in a performance chapter. They are both about making the user do something that ideally would be the compilers job.

How about performance-tips.rst?

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 22, 2014

Member

I like the idea of a new "performance tweaks" chapter

Member

jiahao commented Jan 22, 2014

I like the idea of a new "performance tweaks" chapter

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 22, 2014

Member

If we're still planning to implement #2299, I suspect we'll need eventually need a whole chapter just for SIMD.

Member

simonster commented Jan 22, 2014

If we're still planning to implement #2299, I suspect we'll need eventually need a whole chapter just for SIMD.

@tknopp

This comment has been minimized.

Show comment
Hide comment
@tknopp

tknopp Jan 22, 2014

Contributor

@simonster Hopefully not. The autovectorizer of llvm is pretty good and I have doubts that writing hand-written SIMD code is always faster. I made some experience and writting a simple matrix vector multiplication in C with autovectorization is as fast as the SIMD optimized Eigen routines (was using gcc when I tested this)

Contributor

tknopp commented Jan 22, 2014

@simonster Hopefully not. The autovectorizer of llvm is pretty good and I have doubts that writing hand-written SIMD code is always faster. I made some experience and writting a simple matrix vector multiplication in C with autovectorization is as fast as the SIMD optimized Eigen routines (was using gcc when I tested this)

@lindahua

This comment has been minimized.

Show comment
Hide comment
@lindahua

lindahua Jan 22, 2014

Member

I agree that when this lands, #2299 might be less urgent than before. Still, there are plenty of cases where explicit use of SIMD instructions are desired.

Latest advancement in compiler technology makes the compilers more intelligent, and they are now able to detect & vectorize simple loops (e.g. mapping and simple reduction, or sometimes matrix multiplication patterns).

However, they are still not smart enough to automatically vectorize more complex computation: for example, image filtering, small matrix algebra (where an entire matrix can fit in a small number of AVX registers, and one can finish 8x8 matrix multiplication in less than 100 CPU cycles using carefully crafted SIMD massage, as well as transcendental functions, etc.

Member

lindahua commented Jan 22, 2014

I agree that when this lands, #2299 might be less urgent than before. Still, there are plenty of cases where explicit use of SIMD instructions are desired.

Latest advancement in compiler technology makes the compilers more intelligent, and they are now able to detect & vectorize simple loops (e.g. mapping and simple reduction, or sometimes matrix multiplication patterns).

However, they are still not smart enough to automatically vectorize more complex computation: for example, image filtering, small matrix algebra (where an entire matrix can fit in a small number of AVX registers, and one can finish 8x8 matrix multiplication in less than 100 CPU cycles using carefully crafted SIMD massage, as well as transcendental functions, etc.

@lindahua

This comment has been minimized.

Show comment
Hide comment
@lindahua

lindahua Jan 22, 2014

Member

Here is Intel's example of using AVX for 8x8 matrix multiplication, which can be accomplished in about 100 cycles using AVX:

http://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices

There are plenty of SIMD tricks, such as broadcasting, shuffling, unpacking, etc. I haven't seen a compiler that is smart enough to turn a C for loop into such codes.

Member

lindahua commented Jan 22, 2014

Here is Intel's example of using AVX for 8x8 matrix multiplication, which can be accomplished in about 100 cycles using AVX:

http://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices

There are plenty of SIMD tricks, such as broadcasting, shuffling, unpacking, etc. I haven't seen a compiler that is smart enough to turn a C for loop into such codes.

@tknopp

This comment has been minimized.

Show comment
Hide comment
@tknopp

tknopp Jan 22, 2014

Contributor

Yes, there are cases where the compiler is not smart enough. But writing handwritten SIMD is a serious maintainance burden. And then the compiler guys catch up and one is on par again. Its great when we have support for handwritten SIMD instructions. But this is experts stuff for people like you and not for "regular" users that go through the manual.

Contributor

tknopp commented Jan 22, 2014

Yes, there are cases where the compiler is not smart enough. But writing handwritten SIMD is a serious maintainance burden. And then the compiler guys catch up and one is on par again. Its great when we have support for handwritten SIMD instructions. But this is experts stuff for people like you and not for "regular" users that go through the manual.

@lindahua

This comment has been minimized.

Show comment
Hide comment
@lindahua

lindahua Jan 22, 2014

Member

Sure, that's why I said it is not as urgent. Obviously, the auto-vectorization stuff needs to land sooner.

Member

lindahua commented Jan 22, 2014

Sure, that's why I said it is not as urgent. Obviously, the auto-vectorization stuff needs to land sooner.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 22, 2014

Contributor

For documenting @simd, I'll pursue the suggestion of adding to performance-tips.rst. It makes sense, at least for now, to document @inbounds and @simd there since they are both guarantees from the programmer about certain program properties that can enable performance improvements. Furthermore, the current implementation of @simd requires use of @inbounds for effective vectorization, though this requirement is one that I hope to eliminate in the future since, in principle, @simd permissive evaluation order should allow hoisting bounds checks or vectorizing them..

Contributor

ArchRobison commented Jan 22, 2014

For documenting @simd, I'll pursue the suggestion of adding to performance-tips.rst. It makes sense, at least for now, to document @inbounds and @simd there since they are both guarantees from the programmer about certain program properties that can enable performance improvements. Furthermore, the current implementation of @simd requires use of @inbounds for effective vectorization, though this requirement is one that I hope to eliminate in the future since, in principle, @simd permissive evaluation order should allow hoisting bounds checks or vectorizing them..

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 27, 2014

Contributor

I rebased and somehow got some of Jeff's unrelated commits mixed into my pull request. How do I remove them?

Contributor

ArchRobison commented Jan 27, 2014

I rebased and somehow got some of Jeff's unrelated commits mixed into my pull request. How do I remove them?

@pao

This comment has been minimized.

Show comment
Hide comment
@pao

pao Jan 27, 2014

Member

There's a merge at the top of your history which is probably something to do with it.

You can make a backup of the branch (git branch my-backup-branch) then try to remove them with interactive rebase (git rebase -i master). Delete the lines for the commits you want to discard. You may need to delete d8cea4c.

Member

pao commented Jan 27, 2014

There's a merge at the top of your history which is probably something to do with it.

You can make a backup of the branch (git branch my-backup-branch) then try to remove them with interactive rebase (git rebase -i master). Delete the lines for the commits you want to discard. You may need to delete d8cea4c.

@ivarne

This comment has been minimized.

Show comment
Hide comment
@ivarne

ivarne Jan 27, 2014

Contributor

The bacup branch is not needed. If you know how to create and delete branches, git reflog gives you the sha for the commit before you screwed up, so that you can reset and try again.

You can also discard commits without changing your working directory with git reset master and start adding and committing your work.

Contributor

ivarne commented Jan 27, 2014

The bacup branch is not needed. If you know how to create and delete branches, git reflog gives you the sha for the commit before you screwed up, so that you can reset and try again.

You can also discard commits without changing your working directory with git reset master and start adding and committing your work.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 28, 2014

Contributor

I got rid of Jeff's commits but picked up other recent commits. How does Github decide which commits are part of my pull request? Isn't it just the diff between the master on my fork and the branch adr/simdloop on my fork?

Contributor

ArchRobison commented Jan 28, 2014

I got rid of Jeff's commits but picked up other recent commits. How does Github decide which commits are part of my pull request? Isn't it just the diff between the master on my fork and the branch adr/simdloop on my fork?

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 28, 2014

Member

It's any commits that are on your branch but not the branch you're looking to merge into – i.e. master.

Member

StefanKarpinski commented Jan 28, 2014

It's any commits that are on your branch but not the branch you're looking to merge into – i.e. master.

@pao

This comment has been minimized.

Show comment
Hide comment
@pao

pao Jan 28, 2014

Member

The bacup branch is not needed. If you know how to create and delete branches, git reflog gives you the sha for the commit before you screwed up, so that you can reset and try again.

You can do this, but you're relying on git's GC not to kick in, see that those commits aren't currently connected to the DAG, and remove them. The branch is free and easily deleted afterwards. (@ivarne)

Member

pao commented Jan 28, 2014

The bacup branch is not needed. If you know how to create and delete branches, git reflog gives you the sha for the commit before you screwed up, so that you can reset and try again.

You can do this, but you're relying on git's GC not to kick in, see that those commits aren't currently connected to the DAG, and remove them. The branch is free and easily deleted afterwards. (@ivarne)

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Apr 25, 2014

Member

One other question. This is perhaps not the right place, but I am trying to see if @simd will be able to speed up sparse matvec. It doesn't seem to help. Would it be possible to further characterize the cases where one may expect a speedup?

function simdmatvec(A::SparseMatrixCSC, x::AbstractVector, y::AbstractVector)
    nzv = A.nzval
    rv = A.rowval
    @inbounds for col = 1 : A.n
    xcol = x[col]
    k1 = A.colptr[col]
    k2 = A.colptr[col+1]-1
        @simd for k = k1:k2
            y[rv[k]] += nzv[k]*xcol
    end
    end
    y
end

Incidentally, here the @inbounds is outside the @simd block.

Member

ViralBShah commented Apr 25, 2014

One other question. This is perhaps not the right place, but I am trying to see if @simd will be able to speed up sparse matvec. It doesn't seem to help. Would it be possible to further characterize the cases where one may expect a speedup?

function simdmatvec(A::SparseMatrixCSC, x::AbstractVector, y::AbstractVector)
    nzv = A.nzval
    rv = A.rowval
    @inbounds for col = 1 : A.n
    xcol = x[col]
    k1 = A.colptr[col]
    k2 = A.colptr[col+1]-1
        @simd for k = k1:k2
            y[rv[k]] += nzv[k]*xcol
    end
    end
    y
end

Incidentally, here the @inbounds is outside the @simd block.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Apr 25, 2014

Member

You'll be happy to know that sparse multiply of all kinds already got much faster with this patch, without explicit @simd. For me:

Before:

sparsemul            86.268   88.148   87.075    0.837
sparsemul2           49.249   64.618   55.687    7.234
sparserange          44.261   67.173   55.907   11.141
matvec                9.048    9.085    9.071    0.014

After:

sparsemul            48.300   50.560   49.398    0.965
sparsemul2           47.177   60.579   52.202    6.758
sparserange          40.913   63.947   52.374   11.493
matvec                6.078    6.126    6.107    0.020
Member

JeffBezanson commented Apr 25, 2014

You'll be happy to know that sparse multiply of all kinds already got much faster with this patch, without explicit @simd. For me:

Before:

sparsemul            86.268   88.148   87.075    0.837
sparsemul2           49.249   64.618   55.687    7.234
sparserange          44.261   67.173   55.907   11.141
matvec                9.048    9.085    9.071    0.014

After:

sparsemul            48.300   50.560   49.398    0.965
sparsemul2           47.177   60.579   52.202    6.758
sparserange          40.913   63.947   52.374   11.493
matvec                6.078    6.126    6.107    0.020
@tkelman

This comment has been minimized.

Show comment
Hide comment
@tkelman

tkelman Apr 25, 2014

Contributor

30% better on matvec is great! Interesting that the big improvement on sparse matmul happens for the dense matrix of ones represented in CSC form, where successive indices are adjacent. For the more randomized sparsity pattern in sparsemul2 it's only about 6% better. Harder to take advantage of SIMD with truly random access, but I'm starting to see why PETSc and other libraries go to the trouble of implementing block-sparse formats.

Contributor

tkelman commented Apr 25, 2014

30% better on matvec is great! Interesting that the big improvement on sparse matmul happens for the dense matrix of ones represented in CSC form, where successive indices are adjacent. For the more randomized sparsity pattern in sparsemul2 it's only about 6% better. Harder to take advantage of SIMD with truly random access, but I'm starting to see why PETSc and other libraries go to the trouble of implementing block-sparse formats.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Apr 25, 2014

Contributor

#5355 brought in type-based alias analysis, which I suspect is what helps the sparse matrix code, because it enables much better hoisting and common sub-expression elimination.

My understanding of the current state of the LLVM vectorizer is that @simd can only pay off in the following circumstances:

  1. The loop must be an innermost loop.
  2. The loop body must be straight-line code. That's why @inbounds is currently needed.
  3. Accesses must have a stride pattern. I.e., no "gathers" (random-index reads) or "scatters".(random-index writes).
  4. The stride should be unit stride. The vectorizer can deal with non-unit strides, but it uses scalar loads/stores in this case, which is likely to dominate the execution time.
  5. The number of arrays accessed is too large for LLVM's auto-vectorization to kick in (which #5355 also brought in). In fact, all that @simd does is tell the compiler "don't worry about cross-iteration dependencies". With only 2 or 3 arrays accessed by a loop, the auto-vectorization usually kicks in.

Speaking to Intel's instructions sets, Intel's AVX2 adds a gather instruction, though so far LLVM doesn't seem to know how to use it.

Contributor

ArchRobison commented Apr 25, 2014

#5355 brought in type-based alias analysis, which I suspect is what helps the sparse matrix code, because it enables much better hoisting and common sub-expression elimination.

My understanding of the current state of the LLVM vectorizer is that @simd can only pay off in the following circumstances:

  1. The loop must be an innermost loop.
  2. The loop body must be straight-line code. That's why @inbounds is currently needed.
  3. Accesses must have a stride pattern. I.e., no "gathers" (random-index reads) or "scatters".(random-index writes).
  4. The stride should be unit stride. The vectorizer can deal with non-unit strides, but it uses scalar loads/stores in this case, which is likely to dominate the execution time.
  5. The number of arrays accessed is too large for LLVM's auto-vectorization to kick in (which #5355 also brought in). In fact, all that @simd does is tell the compiler "don't worry about cross-iteration dependencies". With only 2 or 3 arrays accessed by a loop, the auto-vectorization usually kicks in.

Speaking to Intel's instructions sets, Intel's AVX2 adds a gather instruction, though so far LLVM doesn't seem to know how to use it.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Apr 25, 2014

Member

Thanks. I will update this in the docs. I was hoping that LLVM would be able to leverage the gather instruction for access patterns such as this one. That was why I was trying the matvec example. This has been very informative.

Member

ViralBShah commented Apr 25, 2014

Thanks. I will update this in the docs. I was hoping that LLVM would be able to leverage the gather instruction for access patterns such as this one. That was why I was trying the matvec example. This has been very informative.

@mlubin

This comment has been minimized.

Show comment
Hide comment
@mlubin

mlubin May 7, 2014

Member

Perhaps a documentation issue: does SIMD work on a standard build or do we need to use a special version of LLVM? If so, what's required in Make.user?

Member

mlubin commented May 7, 2014

Perhaps a documentation issue: does SIMD work on a standard build or do we need to use a special version of LLVM? If so, what's required in Make.user?

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison May 8, 2014

Contributor

It should work with a standard build.

Contributor

ArchRobison commented May 8, 2014

It should work with a standard build.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Sep 15, 2014

Contributor

I've posted long article on @simd, similar to my JuliaCon talk.

Contributor

ArchRobison commented Sep 15, 2014

I've posted long article on @simd, similar to my JuliaCon talk.

jiahao added a commit to JuliaLang/julialang.github.com that referenced this pull request Sep 15, 2014

@jakebolewski

This comment has been minimized.

Show comment
Hide comment
@jakebolewski

jakebolewski Sep 15, 2014

Member

@ArchRobison that article was fantastic!

Member

jakebolewski commented Sep 15, 2014

@ArchRobison that article was fantastic!

@vchuravy

This comment has been minimized.

Show comment
Hide comment
@vchuravy

vchuravy Jun 16, 2015

Member

Recently there has been work on enabling interleaved memory accesses [1] in llvm. I am wondering how to best use this in combination with the SIMD work

[1] http://reviews.llvm.org/rL239291

Member

vchuravy commented Jun 16, 2015

Recently there has been work on enabling interleaved memory accesses [1] in llvm. I am wondering how to best use this in combination with the SIMD work

[1] http://reviews.llvm.org/rL239291

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jun 16, 2015

Contributor

I see the feature is off by default. Maybe we could enable it with -O? My initial take is that the poster child for vectorizing interleaved memory access is complex arithmetic, but typically that involves complex multiplications which will require more work in LLVM to vectorize.

Contributor

ArchRobison commented Jun 16, 2015

I see the feature is off by default. Maybe we could enable it with -O? My initial take is that the poster child for vectorizing interleaved memory access is complex arithmetic, but typically that involves complex multiplications which will require more work in LLVM to vectorize.

@jackmott

This comment has been minimized.

Show comment
Hide comment
@jackmott

jackmott Jan 19, 2016

I would like to add a vote for some method of doing SIMD by hand, whether it be part of a standard library or a language feature. Probably 90% of the potential benefit of SIMD is not going to be realized with automatic vectorization, and compilers aren't going to bridge that gap significantly ever. Consider for example, implementation of common noise functions like Perlin noise. These involve dozens of steps, a few branches, lookup tables, things the compilers won't be figuring out in my lifetime. My hand written SIMD achieved 3-5x speedup (128 vs 256bit wide varieties) over what the latest compilers manage to do automatically and I am a complete novice. There is a whole universe of applications - games, image processing, video streaming, video editing, physics and number theory research, where programmers are forced to drop down to C or accept code that is 3x->10x slower than it needs to be. With 512bit wide SIMD coming into the market it is too powerful to ignore, and adding good support for SIMD immediately differentiates your language from the other new languages out there which mostly ignore SIMD.

I would like to add a vote for some method of doing SIMD by hand, whether it be part of a standard library or a language feature. Probably 90% of the potential benefit of SIMD is not going to be realized with automatic vectorization, and compilers aren't going to bridge that gap significantly ever. Consider for example, implementation of common noise functions like Perlin noise. These involve dozens of steps, a few branches, lookup tables, things the compilers won't be figuring out in my lifetime. My hand written SIMD achieved 3-5x speedup (128 vs 256bit wide varieties) over what the latest compilers manage to do automatically and I am a complete novice. There is a whole universe of applications - games, image processing, video streaming, video editing, physics and number theory research, where programmers are forced to drop down to C or accept code that is 3x->10x slower than it needs to be. With 512bit wide SIMD coming into the market it is too powerful to ignore, and adding good support for SIMD immediately differentiates your language from the other new languages out there which mostly ignore SIMD.

@iamed2

This comment has been minimized.

Show comment
Hide comment
@iamed2

iamed2 Jan 19, 2016

Contributor

@jackmott You may be able to manually vectorize using llvmcall, but that would require knowledge of LLVM IR

Contributor

iamed2 commented Jan 19, 2016

@jackmott You may be able to manually vectorize using llvmcall, but that would require knowledge of LLVM IR

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 19, 2016

Contributor

I've been wanting to write a small library based on NTuple and llvmcall for some time...

Contributor

eschnett commented Jan 19, 2016

I've been wanting to write a small library based on NTuple and llvmcall for some time...

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 19, 2016

Member

That would be awesome. Would be great to have simd types and operations within easy reach.

Member

JeffBezanson commented Jan 19, 2016

That would be awesome. Would be great to have simd types and operations within easy reach.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 19, 2016

Member

We could reopen #2299

Member

JeffBezanson commented Jan 19, 2016

We could reopen #2299

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 20, 2016

Contributor

Here we go:

julia> workspace(); using SIMD; code_native(sqrt, (Vec{4,Float64},))
    .section    __TEXT,__text,regular,pure_instructions
Filename: SIMD.jl
Source line: 0
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 186
    vsqrtpd (%rsi), %ymm0
    vextractf128    $1, %ymm0, %xmm1
Source line: 5
    vmovhpd %xmm1, 24(%rdi)
    vmovlpd %xmm1, 16(%rdi)
    vmovhpd %xmm0, 8(%rdi)
    vmovlpd %xmm0, (%rdi)
    movq    %rdi, %rax
    popq    %rbp
    vzeroupper
    retq

This is with Julia master, using LLVM 3.7.1. LLVM seems to be a bit confused about how to store an array to memory, leading to the ugly vmov sequence in the end, but the actual vectorization works like a charm. See https://github.com/eschnett/SIMD.jl for the proof of concept.

Contributor

eschnett commented Jan 20, 2016

Here we go:

julia> workspace(); using SIMD; code_native(sqrt, (Vec{4,Float64},))
    .section    __TEXT,__text,regular,pure_instructions
Filename: SIMD.jl
Source line: 0
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 186
    vsqrtpd (%rsi), %ymm0
    vextractf128    $1, %ymm0, %xmm1
Source line: 5
    vmovhpd %xmm1, 24(%rdi)
    vmovlpd %xmm1, 16(%rdi)
    vmovhpd %xmm0, 8(%rdi)
    vmovlpd %xmm0, (%rdi)
    movq    %rdi, %rax
    popq    %rbp
    vzeroupper
    retq

This is with Julia master, using LLVM 3.7.1. LLVM seems to be a bit confused about how to store an array to memory, leading to the ugly vmov sequence in the end, but the actual vectorization works like a charm. See https://github.com/eschnett/SIMD.jl for the proof of concept.

@vchuravy

This comment has been minimized.

Show comment
Hide comment
@vchuravy

vchuravy Jan 20, 2016

Member

@eschnett I assume I am to quick, but SIMD.jl is still empty ;)

Member

vchuravy commented Jan 20, 2016

@eschnett I assume I am to quick, but SIMD.jl is still empty ;)

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 20, 2016

Contributor

Thank you, forgot to push after adding the code.

Contributor

eschnett commented Jan 20, 2016

Thank you, forgot to push after adding the code.

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 21, 2016

Contributor

@JeffBezanson I notice that Julia tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD instructions, one has to convert in a series of extractvalue and insertelement instructions. Unfortunately, it turns out that LLVM (3.7, x86-64) is not good at optimizing these, leading at certain occasions to cumbersome generated code that breaks vectors into scalars and re-assembles them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these can be efficiently bitcast to LLVM vector types. That leads to efficient code, but is more complex on the Julia side.

Contributor

eschnett commented Jan 21, 2016

@JeffBezanson I notice that Julia tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD instructions, one has to convert in a series of extractvalue and insertelement instructions. Unfortunately, it turns out that LLVM (3.7, x86-64) is not good at optimizing these, leading at certain occasions to cumbersome generated code that breaks vectors into scalars and re-assembles them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these can be efficiently bitcast to LLVM vector types. That leads to efficient code, but is more complex on the Julia side.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 21, 2016

Contributor

I'm on sabbatical (four more days!) and largely ignoring email and Github.
But apropos to this issue, I have an extant LLVM patch that fixes the
"cumbersome code" problem that Erik observed. The patch was developed
after I discovered from experience that mapping tuples to LLVM vectors was
not going to work well.

On Thu, Jan 21, 2016 at 8:25 AM, Erik Schnetter notifications@github.com
wrote:

@JeffBezanson https://github.com/JeffBezanson I notice that Julia
tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD
instructions, one has to convert in a series of extractvalue and
insertelement instructions. Unfortunately, it turns out that LLVM (3.7,
x86-64) is not good at optimizing these, leading at certain occasions to
cumbersome generated code that breaks vectors into scalars and re-assembles
them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these
can be efficiently bitcast to LLVM vector types. That leads to efficient
code, but is more complex on the Julia side.


Reply to this email directly or view it on GitHub
#5355 (comment).

Contributor

ArchRobison commented Jan 21, 2016

I'm on sabbatical (four more days!) and largely ignoring email and Github.
But apropos to this issue, I have an extant LLVM patch that fixes the
"cumbersome code" problem that Erik observed. The patch was developed
after I discovered from experience that mapping tuples to LLVM vectors was
not going to work well.

On Thu, Jan 21, 2016 at 8:25 AM, Erik Schnetter notifications@github.com
wrote:

@JeffBezanson https://github.com/JeffBezanson I notice that Julia
tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD
instructions, one has to convert in a series of extractvalue and
insertelement instructions. Unfortunately, it turns out that LLVM (3.7,
x86-64) is not good at optimizing these, leading at certain occasions to
cumbersome generated code that breaks vectors into scalars and re-assembles
them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these
can be efficiently bitcast to LLVM vector types. That leads to efficient
code, but is more complex on the Julia side.


Reply to this email directly or view it on GitHub
#5355 (comment).

@yuyichao

This comment has been minimized.

Show comment
Hide comment
@yuyichao

yuyichao Jan 21, 2016

Member

It'll be nice if we have a standardized type for llvm vectors since they might be necessary to (c)call some vector math libraries.

Member

yuyichao commented Jan 21, 2016

It'll be nice if we have a standardized type for llvm vectors since they might be necessary to (c)call some vector math libraries.

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 21, 2016

Contributor

@ArchRobison I'm looking forward to trying your patch.

For the record, this is how a simple loop (summing an array of Float64) currently looks:

L224:
    vmovq   %rdx, %xmm0
    vmovq   %rbx, %xmm1
    vunpcklpd   %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm1[0]
    vmovq   %rdi, %xmm1
    vmovq   %rsi, %xmm2
    vunpcklpd   %xmm2, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm1, %ymm0, %ymm0
    vaddpd  (%rcx), %ymm0, %ymm0
    vextractf128    $1, %ymm0, %xmm1
    vpextrq $1, %xmm1, %rsi
    vmovq   %xmm1, %rdi
    vpextrq $1, %xmm0, %rbx
    vmovq   %xmm0, %rdx
    addq    $32, %rcx
    addq    $-4, %rax
    jne L224

Only the add instructions are real; the move, extract, unpack, and insert instructions are strictly redundant.

Contributor

eschnett commented Jan 21, 2016

@ArchRobison I'm looking forward to trying your patch.

For the record, this is how a simple loop (summing an array of Float64) currently looks:

L224:
    vmovq   %rdx, %xmm0
    vmovq   %rbx, %xmm1
    vunpcklpd   %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm1[0]
    vmovq   %rdi, %xmm1
    vmovq   %rsi, %xmm2
    vunpcklpd   %xmm2, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm1, %ymm0, %ymm0
    vaddpd  (%rcx), %ymm0, %ymm0
    vextractf128    $1, %ymm0, %xmm1
    vpextrq $1, %xmm1, %rsi
    vmovq   %xmm1, %rdi
    vpextrq $1, %xmm0, %rbx
    vmovq   %xmm0, %rdx
    addq    $32, %rcx
    addq    $-4, %rax
    jne L224

Only the add instructions are real; the move, extract, unpack, and insert instructions are strictly redundant.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 21, 2016

Member

I recall some problems in mapping tuples to vectors, very likely involving alignment, calling convention, and/or bugs in LLVM. It's clear that only a small subset of tuple types can potentially be vector types, so there's ambiguity about whether a given tuple will be a struct or vector or array, which can cause subtle bugs interoperating with native code.

Member

JeffBezanson commented Jan 21, 2016

I recall some problems in mapping tuples to vectors, very likely involving alignment, calling convention, and/or bugs in LLVM. It's clear that only a small subset of tuple types can potentially be vector types, so there's ambiguity about whether a given tuple will be a struct or vector or array, which can cause subtle bugs interoperating with native code.

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 22, 2016

Contributor

I had the mapping from tuples to vectors working, with all the fixes for
alignment. It was messily context sensitive. But that wasn't the
show-stopper. What killed it was that it hurt performance as often as it
helped. My conclusion was that the mapping to vectors needs to happen much
later in the compilation pipeline, when LLVM can be sure it will likely pay
off. On Monday, when I return to the office after my 9 week absence, I'll
track down the status of review my LLVM patch. (It's context is probably
bit-rotted by now.)

  • Arch

On Thu, Jan 21, 2016 at 12:44 PM, Jeff Bezanson notifications@github.com
wrote:

I recall some problems in mapping tuples to vectors, very likely involving
alignment, calling convention, and/or bugs in LLVM. It's clear that only a
small subset of tuple types can potentially be vector types, so there's
ambiguity about whether a given tuple will be a struct or vector or array,
which can cause subtle bugs interoperating with native code.


Reply to this email directly or view it on GitHub
#5355 (comment).

Contributor

ArchRobison commented Jan 22, 2016

I had the mapping from tuples to vectors working, with all the fixes for
alignment. It was messily context sensitive. But that wasn't the
show-stopper. What killed it was that it hurt performance as often as it
helped. My conclusion was that the mapping to vectors needs to happen much
later in the compilation pipeline, when LLVM can be sure it will likely pay
off. On Monday, when I return to the office after my 9 week absence, I'll
track down the status of review my LLVM patch. (It's context is probably
bit-rotted by now.)

  • Arch

On Thu, Jan 21, 2016 at 12:44 PM, Jeff Bezanson notifications@github.com
wrote:

I recall some problems in mapping tuples to vectors, very likely involving
alignment, calling convention, and/or bugs in LLVM. It's clear that only a
small subset of tuple types can potentially be vector types, so there's
ambiguity about whether a given tuple will be a struct or vector or array,
which can cause subtle bugs interoperating with native code.


Reply to this email directly or view it on GitHub
#5355 (comment).

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 28, 2016

Contributor

@ArchRobison Did you have time to look for the patch?

Contributor

eschnett commented Jan 28, 2016

@ArchRobison Did you have time to look for the patch?

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 28, 2016

Contributor

Yes, and I updated it this morning per suggestions from LLVM reviewers while I was out. The patch has two parts:

http://reviews.llvm.org/D14185
http://reviews.llvm.org/D14260

Contributor

ArchRobison commented Jan 28, 2016

Yes, and I updated it this morning per suggestions from LLVM reviewers while I was out. The patch has two parts:

http://reviews.llvm.org/D14185
http://reviews.llvm.org/D14260

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 28, 2016

Contributor

@ArchRobison I'm currently generating SIMD code like this:

julia> @code_llvm Vec{2,Float64}(1) + Vec{2,Float64}(2)

define void @"julia_+_23864.1"(%Vec.12* sret, %Vec.12*, %Vec.12*) #0 {
top:
  %3 = getelementptr inbounds %Vec.12, %Vec.12* %1, i64 0, i32 0
  %4 = load [2 x double], [2 x double]* %3, align 8
  %5 = getelementptr inbounds %Vec.12, %Vec.12* %2, i64 0, i32 0
  %6 = load [2 x double], [2 x double]* %5, align 8
  %arg1arr_0.i = extractvalue [2 x double] %4, 0
  %arg1_0.i = insertelement <2 x double> undef, double %arg1arr_0.i, i32 0
  %arg1arr_1.i = extractvalue [2 x double] %4, 1
  %arg1.i = insertelement <2 x double> %arg1_0.i, double %arg1arr_1.i, i32 1
  %arg2arr_0.i = extractvalue [2 x double] %6, 0
  %arg2_0.i = insertelement <2 x double> undef, double %arg2arr_0.i, i32 0
  %arg2arr_1.i = extractvalue [2 x double] %6, 1
  %arg2.i = insertelement <2 x double> %arg2_0.i, double %arg2arr_1.i, i32 1
  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %res_0.i = extractelement <2 x double> %res.i, i32 0
  %resarr_0.i = insertvalue [2 x double] undef, double %res_0.i, 0
  %res_1.i = extractelement <2 x double> %res.i, i32 1
  %resarr.i = insertvalue [2 x double] %resarr_0.i, double %res_1.i, 1
  %7 = getelementptr inbounds %Vec.12, %Vec.12* %0, i64 0, i32 0
  store [2 x double] %resarr.i, [2 x double]* %7, align 8
  ret void
}

That is:

  • a sequence of extractvalue/insertelement to convert the Julia tuple/LLVM array to an LLVM vector
  • a single LLVM vector operation (here: add)
  • a sequence of extractelement/insertvalue to convert the LLVM vector back to a LLVM array/Julia tuple

With your patches, would this still be a good way to proceed?
Or should this be a sequence of scalar operations instead, omitting the insert-/extractelement statements?

Contributor

eschnett commented Jan 28, 2016

@ArchRobison I'm currently generating SIMD code like this:

julia> @code_llvm Vec{2,Float64}(1) + Vec{2,Float64}(2)

define void @"julia_+_23864.1"(%Vec.12* sret, %Vec.12*, %Vec.12*) #0 {
top:
  %3 = getelementptr inbounds %Vec.12, %Vec.12* %1, i64 0, i32 0
  %4 = load [2 x double], [2 x double]* %3, align 8
  %5 = getelementptr inbounds %Vec.12, %Vec.12* %2, i64 0, i32 0
  %6 = load [2 x double], [2 x double]* %5, align 8
  %arg1arr_0.i = extractvalue [2 x double] %4, 0
  %arg1_0.i = insertelement <2 x double> undef, double %arg1arr_0.i, i32 0
  %arg1arr_1.i = extractvalue [2 x double] %4, 1
  %arg1.i = insertelement <2 x double> %arg1_0.i, double %arg1arr_1.i, i32 1
  %arg2arr_0.i = extractvalue [2 x double] %6, 0
  %arg2_0.i = insertelement <2 x double> undef, double %arg2arr_0.i, i32 0
  %arg2arr_1.i = extractvalue [2 x double] %6, 1
  %arg2.i = insertelement <2 x double> %arg2_0.i, double %arg2arr_1.i, i32 1
  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %res_0.i = extractelement <2 x double> %res.i, i32 0
  %resarr_0.i = insertvalue [2 x double] undef, double %res_0.i, 0
  %res_1.i = extractelement <2 x double> %res.i, i32 1
  %resarr.i = insertvalue [2 x double] %resarr_0.i, double %res_1.i, 1
  %7 = getelementptr inbounds %Vec.12, %Vec.12* %0, i64 0, i32 0
  store [2 x double] %resarr.i, [2 x double]* %7, align 8
  ret void
}

That is:

  • a sequence of extractvalue/insertelement to convert the Julia tuple/LLVM array to an LLVM vector
  • a single LLVM vector operation (here: add)
  • a sequence of extractelement/insertvalue to convert the LLVM vector back to a LLVM array/Julia tuple

With your patches, would this still be a good way to proceed?
Or should this be a sequence of scalar operations instead, omitting the insert-/extractelement statements?

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 28, 2016

Contributor

Yes and no. The patch http://reviews.llvm.org/D14260 deals with optimizing the store. I ran your example through (using %Vec.12 = type { [2 x double] }, and the store was indeed to:

  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %7 = bitcast %Vec.12* %0 to <2 x double>*
  store <2 x double> %res.i, <2 x double>* %7, align 8
  ret void

But the load sequence was not optimized. The problem is that http://reviews.llvm.org/D14185 is targeting the situation where the tuple code is still fully scalar LLVM IR (such as this example from the unit tests), not partially vectorize code as in your example. For what you are doing, is it practical to generate fully scalar LLVM IR? Or do we need to consider adding another instruction-combining transform to LLVM?

Contributor

ArchRobison commented Jan 28, 2016

Yes and no. The patch http://reviews.llvm.org/D14260 deals with optimizing the store. I ran your example through (using %Vec.12 = type { [2 x double] }, and the store was indeed to:

  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %7 = bitcast %Vec.12* %0 to <2 x double>*
  store <2 x double> %res.i, <2 x double>* %7, align 8
  ret void

But the load sequence was not optimized. The problem is that http://reviews.llvm.org/D14185 is targeting the situation where the tuple code is still fully scalar LLVM IR (such as this example from the unit tests), not partially vectorize code as in your example. For what you are doing, is it practical to generate fully scalar LLVM IR? Or do we need to consider adding another instruction-combining transform to LLVM?

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 28, 2016

Contributor

Yes, emitting scalar operations would be straightforward to do.

In the past -- with much older versions of LLVM, and/or with GCC -- it was important to emit arithmetic operations as vector operations since they would otherwise not be synthesized. It seems newer versions of LLVM are much better than this, so this might be the way to go.

Contributor

eschnett commented Jan 28, 2016

Yes, emitting scalar operations would be straightforward to do.

In the past -- with much older versions of LLVM, and/or with GCC -- it was important to emit arithmetic operations as vector operations since they would otherwise not be synthesized. It seems newer versions of LLVM are much better than this, so this might be the way to go.

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 29, 2016

Contributor

Yay! Success!

@ArchRobison Your patch D14260, applied to LLVM 3.7.1, with Julia's master branch and my LLVM-vector version of SIMD, is generating proper SIMD vector instructions without the nonsensical scalarization.

Here are two examples of generated AVX2 code (with bounds checking disabled; keeping it enabled still vectorizes the code, but has two additional branches at every loop iteration):

Adding two arrays:

L176:
    movq    (%r15), %rdx
Source line: 766
    vmovupd (%rcx,%rdx), %ymm0
Source line: 458
    movq    (%rbx), %rsi
Source line: 419
    vaddpd  (%rcx,%rsi), %ymm0, %ymm0
Source line: 803
    vmovupd %ymm0, (%rcx,%rdx)
    movq    %r14, -64(%rbp)
Source line: 62
    addq    $32, %rcx
    addq    $-4, %rax
    jne L176

Calculating the sum of an array:

L128:
    vaddpd  (%rcx), %ymm0, %ymm0
Source line: 55
    addq    $32, %rcx
    addq    $-4, %rax
    jne L128

Accessing the array elements in the first kernel is still too complicated. I assume that LLVM needs to be told that the two arrays don't overlap with the array descriptors. Also, some loop unrolling is called for.

Thanks a million!

Contributor

eschnett commented Jan 29, 2016

Yay! Success!

@ArchRobison Your patch D14260, applied to LLVM 3.7.1, with Julia's master branch and my LLVM-vector version of SIMD, is generating proper SIMD vector instructions without the nonsensical scalarization.

Here are two examples of generated AVX2 code (with bounds checking disabled; keeping it enabled still vectorizes the code, but has two additional branches at every loop iteration):

Adding two arrays:

L176:
    movq    (%r15), %rdx
Source line: 766
    vmovupd (%rcx,%rdx), %ymm0
Source line: 458
    movq    (%rbx), %rsi
Source line: 419
    vaddpd  (%rcx,%rsi), %ymm0, %ymm0
Source line: 803
    vmovupd %ymm0, (%rcx,%rdx)
    movq    %r14, -64(%rbp)
Source line: 62
    addq    $32, %rcx
    addq    $-4, %rax
    jne L176

Calculating the sum of an array:

L128:
    vaddpd  (%rcx), %ymm0, %ymm0
Source line: 55
    addq    $32, %rcx
    addq    $-4, %rax
    jne L128

Accessing the array elements in the first kernel is still too complicated. I assume that LLVM needs to be told that the two arrays don't overlap with the array descriptors. Also, some loop unrolling is called for.

Thanks a million!

@ArchRobison

This comment has been minimized.

Show comment
Hide comment
@ArchRobison

ArchRobison Jan 29, 2016

Contributor

Good to hear it worked. Was that just D14260, or D14260 and D14185[http://reviews.llvm.org/D14185]? (Logically the two diffs belong together, but LLVM review formalities caused the split.)

Contributor

ArchRobison commented Jan 29, 2016

Good to hear it worked. Was that just D14260, or D14260 and D14185[http://reviews.llvm.org/D14185]? (Logically the two diffs belong together, but LLVM review formalities caused the split.)

@eschnett

This comment has been minimized.

Show comment
Hide comment
@eschnett

eschnett Jan 29, 2016

Contributor

This was only D14260. D14185 didn't apply, so I tried without it, and it worked.

Contributor

eschnett commented Jan 29, 2016

This was only D14260. D14185 didn't apply, so I tried without it, and it worked.

eschnett added a commit to eschnett/julia that referenced this pull request Feb 7, 2016

Add LLVM patch D14260 to improve SIMD code
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in #5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
	vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
	addq	%rsi, %rdx
	addq	%rcx, %rdi
	jne	L192
```

which is perfect.

eschnett added a commit to eschnett/julia that referenced this pull request Feb 7, 2016

Add LLVM patch D14260 to improve SIMD code
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in #5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
	vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
	addq	%rsi, %rdx
	addq	%rcx, %rdi
	jne	L192
```

which is perfect.

eschnett added a commit to eschnett/julia that referenced this pull request Feb 8, 2016

Add LLVM patch D14260 to improve SIMD code
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in #5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
vinsertf128 $1, %xmm3, %ymm1, %ymm1
vmovupd 8(%rcx), %xmm2
vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
vaddpd %ymm2, %ymm1, %ymm1
vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
vextractf128 $1, %ymm1, %xmm3
vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
vaddsd (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
addq	%rsi, %rdx
addq	%rcx, %rdi
jne	L192
```

which is perfect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment