Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Add support for @simd #5355

Merged
merged 9 commits into from
@ArchRobison
Collaborator

This pull request enables the LLVM loop vectorizer. It's not quite ready for production. I'd like feedback and help fixing some issues. The overall design is explained in this comment to issue #4786, except that it no longer relies on the "banana interface"mentioned in that comment.

Here is an example that it can vectorize when a is of type Float32, and x and y are of type Array{Float32,1}:

function saxpy( a, x, y )
    @simd for i=1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

I've seen the vectorized version run 3x faster than the unvectorized version when data fits in cache. When AVX can be enabled, the results are likely even better.

Programmers can put the @simd macro in front of one-dimensional for loops that have ranges of the form m:n and the type of the loop index supports < and +`. The decoration is guaranteeing that the loop does not rely on wrap-around behavior and the loop iterations are safe to execute in parallel, even if chunks are done in lockstep.

The patch implements type-based alias analysis , which may help LLVM optimize better in general, and is essential for vectorization. The name "type-based alias analysis" is a bit of a misnomer, since it's really based on hierarchically partitioning memory. I've implemented it for Julia assuming that type-punning is never done for parts of data structures that users cannot access directly, but that user data can be type-punned freely.

Problems that I seek advice on:

  • The @simd macro is not found. Currently I have to do the following within the REPL:
include("base/simdloop.jl")
using SimdLoop.@simd

I tried to copy the way@printf is defined/exported, but something is wrong with my patch. What?

  • LLVM 3.3 disallows attaching metadata to a block, so I've attached it to an instruction in the block. It's kind of ad-hoc, but seems to work. Is there a better way to do it?
  • An alternative to attaching metadata is to eliminate src/llvm-simdloop.cpp and instead rely on LLVM's auto-vectorization capability, which inserts memory dependence tests. That indeed does work for the saxpy example above, i.e. it vectorizes without the support src/llvm-simdloop.cpp. However, @simd would still be necessary to tranform the loop into a form such that LLVM can compute a trip count.
  • An alternative to the trip-count issue is to eliminate @simd altogether and instead somehow ensure that m:n is lowered to a form for which LLVM can compute a trip count.
  • I'm a neophyte at writing macros, base/simdloop.jl could use a review by an expert.

Apologies for the useless comment:

This file defines two entry points:

I just noticed it, but being late on a Friday, I'll fix it later. It's supposed to say that one entry point is for marking simd loops and the other is for later lowering marked loops.

Thanks to @simonster for his information on enabling the loop vectorizer. It was a big help to get me going.

@simonster
Collaborator

Amazing!

@jiahao
Collaborator

:smiley_cat:

@johnmyleswhite
Collaborator

:100:

@JeffBezanson

Amazing, I look forward to reading this in detail. Even just the TBAA part is great to have.

@nolta nolta commented on the diff
base/sysimg.jl
@@ -175,6 +175,9 @@ using .I18n
using .Help
push!(I18n.CALLBACKS, Help.clear_cache)
+# SIMD loops
+include("simdloop.jl")
@nolta Collaborator
nolta added a note

I think you might need a

importall .SimdLoop

here?

@ArchRobison Collaborator

Thanks! Now added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@lindahua
Collaborator

Eagerly looking forward to this.

@ViralBShah
Owner

Likewise. Waiting for this to land.

@ArchRobison
Collaborator

One feature of the pull request is that it enables auto-vectorization of some loops without @simd. But it's quirky, and the underlying reason for the quirkiness needs discussion because with a small change, we might be able to enable wider use of auto-vectorization in Julia. Consider the following example:

function saxpy( a, x, y )
    for i in 1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

LLVM will not auto-vectorize it because cannot compute a trip count. Now change 1:length(x) to (1:length(x))+0. Then (with the current PR) the example does vectorize!

The root issue is that the documented way that Julia lowers for loops works just fine for the vectorizer. But there is an undocumented optimization that gets in the way. If a loop has the form for i in a:b, then it is custom-lowered differently. (See 'for in src/julia-syntax.scm). The custom lowering likely helps compilation time by short-cutting through a lot of analysis and transform. Regrettably it puts the loop in a form where LLVM cannot compute a trip count. Here's sketch of the form (I'm abstracting out some details):

i = a
while i<=b 
    ...
    i = i+1

Assume a and b are of type Int. LLVM cannot compute a trip count because the loop is an infinite loop if b=typemax(Int). The "no signed wrap" (see #3929) enables LLVM to disallow this possibility. So I think we should consider one of two changes to short-cut lowering of for loops:

  • Somehow set the "no signed wrap" flag on the right add instruction, by using an intrinsic per the suggestion of @simonster.
  • Change the lowering to:
i = a
while i<b+1
    ...
    i = i+1

I think an annotation such as @simd is essential to trickier cases where run-time memory disambiguation is impractical. But I think we should consider whether the "short cut" lowering of for loops should be more friendly to auto-vectorization.

Comments?

@simonster
Collaborator

This seems kind of like a bug in the current lowering, since for i = typemax(Int):typemax(Int); end should probably not be an infinite loop. Changing the lowering to i < b+1 would cause a loop ending in typemax(Int) not to be executed at all, which is still not quite right (although if the current behavior is acceptable, this seems equally acceptable). If we care about handling loops ending in typemax(Int), it seems like we could lower to:

if b >= a
  i = a
  while i != b+1
      ...
      i = i+1
  end
end

Can LLVM compute a trip count in that case?

@JeffBezanson

Wow, it is quite satisfying that the shortcut hack works worse than the general case :)
This is indeed a bug.

It looks to me like @simonster 's solution is the only one that will handle full range. However, the Range type used in the general case can only have up to typemax(Int) elements. The special-case lowering could mimic that:

n = b-a+1
# error if range too big
c = 0
while c < n
    i = c+a
    ...
    c = c+1
end
@StefanKarpinski

If you move the check to the end of the loop, then the fact that it's typemax doesn't matter:

i = a - 1
goto check
while true
    # body
    label check
    i < b || break
    i += 1
end

Edit: fix starting value.

@StefanKarpinski

If you're willing to have an additional branch, then you can avoid the subtraction at the beginning.

@ArchRobison
Collaborator

Is the short-cut expected to be semantically equivalent to the long path? E.g., how finicky should we be about which signatures are expected for the types of the bounds? If I understand correctly, the lowering at this point is happening before type inference. Do we have any measurements on what the short-cut is buying in terms of JIT+execution time or code space? I wondering if perhaps the short-cut could be removed and whatever savings it provided could be made up somewhere else in the compilation chain.

Here are some tricky examples to consider in proposing shortcuts/semantics:

for i=0.0:.1:.25   # Fun with floating-point round.  Tripcount should be 3.
       println(i)
end
for j=typemin(Int):typemin(Int)+1   # Tripcount should be 2.
       println(j)
end
for k=typemax(Int)-1:typemax(Int) # Tripcount should be 2
       println(k)
end

All of these deliver the correct (or at least obvious :-)) results with the long path, but may go astray with some shortcut solutions.

Besides user expectations, something else to consider is the path through the rest of the compilation chain. I suspect that the loop optimizations will almost invariably transform a test-at-top loop into a test-at-bottom loop wrapped in a zero-trip guard, i.e. something like this:

if (loop-test) {
      loop-preheader (compute loop invariants, initialize induction variables)
      do {
          loop body
      } while(loop-test);
}

So if we lower a loop into this form in the first place for semantic reasons, we're probably not creating any extra code bloat since the compiler was going to do it anyway.

@StefanKarpinski

Maybe we should remove the special case handling altogether? At this point with range objects being immutable types and the compiler being quite smart about such, I suspect the special case may no longer be necssary. It originally was very necessary because neither of those things were true.

@simonster
Collaborator

Without special lowering, we have to make a function call to colon, which has to call the Range1 constructor. This appears to have noticeable overhead if the time to execute the loop is short. Let:

function f(A)
    c = 0.0
    for i = 1:10000000
        for j = 1:length(A)
            @inbounds c += A[j]
        end
    end
    c
end

function g(A)
    c = 0.0
    for i = 1:10000000
        rg = 1:length(A)
        for j = rg
            @inbounds c += A[j]
        end
    end
    c
end

The only difference here should be that f(A) gets the special lowering whereas g(A) does not. For A = rand(5), after compilation, f(A) is consistently almost twice as fast:

julia> @time f(A);
elapsed time: 0.03747795 seconds (64 bytes allocated)

julia> @time f(A);
elapsed time: 0.037112331 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066732369 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066190191 seconds (64 bytes allocated)

If A = rand(100), the difference is almost non-existent, but I don't think we should deoptimize small loops. OTOH, if we could fully inline colon and the optimizer can elide the non-negative length check for Range1 construction, maybe this would generate the same code as @JeffBezanson's proposal.

@JeffBezanson

Getting rid of the special case would be great. I'll explore what extra inlining might get us here.

@JeffBezanson

LLVM seems to generate far more compact code with these definitions:

start(r::Range1) = r.start
next{T}(r::Range1{T}, i) = (i, oftype(T, i+1))
done(r::Range1, i) = i==(r.start+r.len)

With that plus full inlining I think we will be ok without the special case. Just need to make sure it can still vectorize the result.

@JeffBezanson

Another idea: use the Range1 type only for integers, and have it store start and stop instead of length. That way the start and stop values can simply be accepted with no checks, and the length method can throw an overflow error if the length can't be represented as an Int. The reason for this is that computing the length is the hard part, and you often don't need it.

Otherwise we are faced with the following:

  1. Check stop<start, set length to 0 if so
  2. Compute checked_add(checked_sub(stop,start),1) to check for over-long ranges
  3. Call Range1 constructor, which must check length<0 in case somebody calls the constructor directly

So there are 3 layers of checks, the third of which is redundant when called from colon. We could have a hidden unsafe constructor that elides check (3), for use by colon, but that's kind of a hack and only addresses a small piece.

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

@ArchRobison
Collaborator

I verified that the auto-vectorizer can vectorize this example, which I believe is equivalent to code after "full inlining" of @JeffBezanson 's changes to Range1.

function saxpy( a, x, y )
    r = 1:length(x)
    s = r.start
    while !(s==(r.start+r.len))
        (i,s) = (s,oftype(Int,s+1))
        @inbounds y[i] = y[i]+a*x[i];
    end
end
@ArchRobison
Collaborator

By the way, it's probably good to limit the shortcut to integer loops, or at least avoid any schemes that rely on floating-point induction variables. Otherwise round-off can cause surprises. Here's a surprise with the current Julia:

a=2.0^53
b=a+2
r = a:b
for i in r        # Performs 3 iterations as expected
    println(i)
end
for i in a:b      # Infinite loop
    println(i)
end
@JeffBezanson

Clearly we need to just remove the special case. That will be a great change.

@JeffBezanson JeffBezanson referenced this pull request from a commit
@JeffBezanson JeffBezanson remove special-case lowering for `for i = a:b` loops (ref #5355)
this fixes some edge-case loops that the special lowering did not
handle correctly.

colon() now checks for overflow in computing the length, which avoids
some buggy Range1s that used to be possible.

this required some changes to make sure Range1 is fast enough:
specialized start, done, next, and a hack to avoid one of the checks and
allow better inlining.

in general performance is about the same, but a few cases are actually
faster, since Range1 is now faster (comprehensions used Range1 instead
of the special-case lowering, for example). also, more loops should be
vectorizable when the appropriate LLVM passes are enabled. all that
plus better correctness and a simpler front-end, and I'm sold.
0860767
@StefanKarpinski

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

This seems quite sensible. I believe this actually addresses things like ranges of Char, BigInt, and other non-traditional types that you might want ranges of. There was another example recently, which I don't recall.

@ArchRobison
Collaborator

Where in the manual should I document @simd? It's fundamentally about relaxing control flow, so doc/manual/control-flow.rst is a logical place. However, @simd is a bit esoteric and might be a distraction there. It's different than the parallel programming model, so doc/manual/parallel-computing.rst doesn't seem like the right place. Should I give @simd its own section in the manual?

@ivarne
Collaborator

I would expect to find something like @inbounds and @simd to be in a performance chapter. They are both about making the user do something that ideally would be the compilers job.

How about performance-tips.rst?

@jiahao
Collaborator

I like the idea of a new "performance tweaks" chapter

@simonster
Collaborator

If we're still planning to implement #2299, I suspect we'll need eventually need a whole chapter just for SIMD.

@tknopp
Collaborator

@simonster Hopefully not. The autovectorizer of llvm is pretty good and I have doubts that writing hand-written SIMD code is always faster. I made some experience and writting a simple matrix vector multiplication in C with autovectorization is as fast as the SIMD optimized Eigen routines (was using gcc when I tested this)

@lindahua
Collaborator

I agree that when this lands, #2299 might be less urgent than before. Still, there are plenty of cases where explicit use of SIMD instructions are desired.

Latest advancement in compiler technology makes the compilers more intelligent, and they are now able to detect & vectorize simple loops (e.g. mapping and simple reduction, or sometimes matrix multiplication patterns).

However, they are still not smart enough to automatically vectorize more complex computation: for example, image filtering, small matrix algebra (where an entire matrix can fit in a small number of AVX registers, and one can finish 8x8 matrix multiplication in less than 100 CPU cycles using carefully crafted SIMD massage, as well as transcendental functions, etc.

@lindahua
Collaborator

Here is Intel's example of using AVX for 8x8 matrix multiplication, which can be accomplished in about 100 cycles using AVX:

http://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices

There are plenty of SIMD tricks, such as broadcasting, shuffling, unpacking, etc. I haven't seen a compiler that is smart enough to turn a C for loop into such codes.

@tknopp
Collaborator

Yes, there are cases where the compiler is not smart enough. But writing handwritten SIMD is a serious maintainance burden. And then the compiler guys catch up and one is on par again. Its great when we have support for handwritten SIMD instructions. But this is experts stuff for people like you and not for "regular" users that go through the manual.

@lindahua
Collaborator

Sure, that's why I said it is not as urgent. Obviously, the auto-vectorization stuff needs to land sooner.

@ArchRobison
Collaborator

For documenting @simd, I'll pursue the suggestion of adding to performance-tips.rst. It makes sense, at least for now, to document @inbounds and @simd there since they are both guarantees from the programmer about certain program properties that can enable performance improvements. Furthermore, the current implementation of @simd requires use of @inbounds for effective vectorization, though this requirement is one that I hope to eliminate in the future since, in principle, @simd permissive evaluation order should allow hoisting bounds checks or vectorizing them..

@ArchRobison
Collaborator

I rebased and somehow got some of Jeff's unrelated commits mixed into my pull request. How do I remove them?

@pao
Collaborator

There's a merge at the top of your history which is probably something to do with it.

You can make a backup of the branch (git branch my-backup-branch) then try to remove them with interactive rebase (git rebase -i master). Delete the lines for the commits you want to discard. You may need to delete d8cea4c.

@ivarne
Collaborator

The bacup branch is not needed. If you know how to create and delete branches, git reflog gives you the sha for the commit before you screwed up, so that you can reset and try again.

You can also discard commits without changing your working directory with git reset master and start adding and committing your work.

@ArchRobison
Collaborator

I got rid of Jeff's commits but picked up other recent commits. How does Github decide which commits are part of my pull request? Isn't it just the diff between the master on my fork and the branch adr/simdloop on my fork?

@StefanKarpinski

It's any commits that are on your branch but not the branch you're looking to merge into – i.e. master.

@pao
Collaborator

The bacup branch is not needed. If you know how to create and delete branches, git reflog gives you the sha for the commit before you screwed up, so that you can reset and try again.

You can do this, but you're relying on git's GC not to kick in, see that those commits aren't currently connected to the DAG, and remove them. The branch is free and easily deleted afterwards. (@ivarne)

@ArchRobison
Collaborator

Thanks to all for the advice about git. I think I have my pull request in shape for a serious review. The pull request includes documentation and a light correctness test. It does not check that vectorization actually happens since that's platform dependent. Maybe there's an existing performance test that could incorporate @simd.

@ArchRobison
Collaborator

Just a reminder/clarification that this pull request is ready for review.

@StefanKarpinski

I went ahead and removed the "WIP" from the title.

@jakebolewski
Collaborator

@ArchRobison I was trying to test this out today with VTune. How did you get the julia source to show up with the assembler output?

@ArchRobison
Collaborator

Alascode_nativeseems to have a bug that makes it fib a lot. I've mostly been using code_llvm to verify successful vectorization (and looking at times). If you are using VTune Amplifier, configure USE_INTEL_JITEVENTS = 1 in Make.inc when you build Julia. Then Amplifier's assembly view can be used to show the code.

Be sure to compare against a Julia build without the pull request as a baseline. I was puzzled when one example sped up just a little with @simd versus without, using the same build of Julia. Eventually I figured out that the pull request was enabling automatic vectorization to kick in for the example without @simd.

@ViralBShah
Owner

@JeffBezanson Is this something that can make it into the 0.3 release?

@ArchRobison ArchRobison referenced this pull request
Open

compiler optimization tracker #3440

33 of 64 tasks complete
@ArchRobison
Collaborator

Rebased.

@StefanKarpinski

I think we should merge this but be 100% clear that it's experimental and that any code that uses the @simd macro might break in the future if we change this.

@ArchRobison
Collaborator

I concur with making it experimental. Field experience could lead to substantial changes.

@ivarne
Collaborator

This should be mentioned in NEWS.md

@ViralBShah
Owner

How about call it @simd_expt so that when it is no longer experimental, it will have a different name?

@ArchRobison
Collaborator

That would be fine with me.

@lindahua
Collaborator

I don't think @simd_expt is necessary. One can easily remove the macro when it causes problems. We just need to clearly document and encourage users to be cautious.

@johnmyleswhite
Collaborator

I agree with Dahua: if you tell people the macro is experimental, you don't need to give it a weird name. For people who want to us it, changing the name later seems unnecessary.

@simonster
Collaborator

Perhaps it can be called @simd but unexported?

@tknopp
Collaborator

Whats the risk of @simd being exported? Its similar to @inbounds which is also not actively documented.

@ivarne
Collaborator

@tknopp Always qoute macro names with backticks, so that you do not ping github user simd. He has not been active on public Github for a while, but he might still not appreciate the email.

The main concern is that we might want to use the @simd name with other semantics, and requiring users to type @Base.simd or @simd_expt is a obvious sign that you might get bug reports telling that something changed in new versions of julia.

@ArchRobison
Collaborator

For loops that operate on just a few vectors and do not involve reductions, the PR often gives you automatic loop vectorization because LLVM is able to insert run-time dependence tests. So some codes will pick up benefit even without using @simd.

@tknopp
Collaborator

@ArchRobison: Just out of interest. Did you made some benchmarks of this PR compared to master? Does the autovectorization cover similar loops as the gcc vectorizer?

@ArchRobison
Collaborator

I'm not familiar with gcc's auto vectorization capabilities. Here are some micro benchmarks that I've been using. All speedups noted are for operating on Float32, running on a 4th generation Intel Core processor (formerly known as Haswell).

  • saxpy calculation: Try it with and without @simd. Either way, the PR makes it run about 4x faster than the Julia master.
  • 2D seismic wave FDTD kernel: It requires building Julia with LLVM 3.4 to demonstrate speedup (about 1.9x for me) over master.
  • inner product of two vectors: This example will not auto vectorize, even with the PR, because it needs the re-association permissions granted by @simd. Improvement with @simd is about 4x over master.
@StefanKarpinski

I say we call this @simd, export it and document it as experimental.

@tknopp
Collaborator

@ArchRobison Impressive! Its so great to see that several individuals make improvements to Julia in various different areas which in sum make such an awesome environment!

The question regarding the gcc autovectorizer is because it is used in -O3 builds by default. And when comparing C vs. Julia I find it quite cruitial to know if the difference in speed is due to usage of SIMD instructions or due to some other limitations. This PR seems to close one further gap to gcc/C usage.

(Probable OpenMP is the last gap to close... but this is another issue)

src/ccall.cpp
@@ -368,7 +368,8 @@ static Value *julia_to_native(Type *ty, jl_value_t *jt, Value *jv,
return builder.CreateBitCast(emit_arrayptr(jv), ty);
}
if (aty == (jl_value_t*)jl_ascii_string_type || aty == (jl_value_t*)jl_utf8_string_type) {
- return builder.CreateBitCast(emit_arrayptr(emit_nthptr(jv,1)), ty);
+ // FIXME - is tbaa_arrayptr correct here?
+ return builder.CreateBitCast(emit_arrayptr(emit_nthptr(jv,1,tbaa_arrayptr)), ty);
@JeffBezanson Owner

I think it should be tbaa_array perhaps? This emit_nthptr is basically doing *(jl_array_t**)addr.

@ArchRobison Collaborator

tbaa_array is the right annotation for a load from a field of a jl_array_t when the particular field is not known. From Jeff's description, it sounds like the load in question is something else, but I'm not sure what. Looking at jl_array_to_string for guidance, it looks to me as if the load here is from an element of a jl_value_t*[2] built expressly for strings. What other loads/stores can get at those pointers?

@JeffBezanson Owner

I see. Yes, it is loading from a jl_value_t**. The pointer is constant, since the string types are immutable.

@ArchRobison Collaborator

I changed tbaa_arrayptr to tbaa_const. LLVM's documentation on TBBA's support for constant memory is not rigorous about defining what constant memory really is. The examples seem less constant than a ROM. I'm taking it to mean that a memory location is constant by the time LLVM can see it. In this case, LLVM can't get at the location until jl_array_to_string allocates it (and assigns to it), so I think I'm on safe ground.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/cgutils.cpp
@@ -1271,18 +1276,18 @@ static Value *emit_arraysize(Value *t, jl_value_t *ex, int dim, jl_codectx_t *ct
{
jl_arrayvar_t *av = arrayvar_for(ex, ctx);
if (av != NULL && dim <= (int)av->sizes.size())
- return builder.CreateLoad(av->sizes[dim-1]);
+ return tbaa_decorate(tbaa_arraysize, builder.CreateLoad(av->sizes[dim-1]));
return emit_arraysize(t, dim);
@simonster Collaborator

I do not understand all of how this works and there may be a good reason for this, but this variation on emit_arraysize, which gets called on vectors, does not decorate with tbaa_arraysize, which seems to prevent broadcast on vectors from getting vectorized.

@JeffBezanson Owner

Vectors are still somewhat pessimized because they can grow, and so their size and data pointer can change in ways the code generator is not yet able to predict.

@simonster Collaborator

What confuses me is that there's a difference between how size(x, 1) and length(x) are optimized for vectors, even though it seems like they should be subject to the same caveats. If I have:

julia> f(x) = for i = 1:size(x, 1); @inbounds x[i] += length(x); end

julia> g(x) = for i = 1:size(x, 1); @inbounds x[i] += size(x, 1); end

then LLVM vectorizes f(Vector{Float32}) but not g(Vector{Float32}).

@JeffBezanson Owner

Interesting; that might not be correct. The tbaa_arraylen in emit_arraylen_prim might need to be removed.

@ArchRobison Collaborator

I missed the point that I was dealing with a jl_arrayvar_t, not a jl_array_t. Looking at routine maybe_alloc_arrayvar, it appears that jl_arrayvar_t:: sizes describes a bunch of stack-local variables. They are definition not related to tbaa_array... metadata. If their addresses are not captured in a way that confuses LLVM, then no metadata is necessary. If they are captured confusingly to LLVM, then some new metadata nodes would be appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ArchRobison
Collaborator

Thanks for the code review. I think the latest revision addresses the comments. Simon's example with g(x) now vectorizes.

@ArchRobison
Collaborator

I documented it as experimental (in bold :-). How can we reach closure on the feature?

@JeffBezanson

This is looking pretty great. I'll probably merge it very soon.

@JeffBezanson

@loladiro It'd be good for you to read this over if you get a chance.

@ArchRobison
Collaborator

Rebased. Some target-machine related logic were removed from the commits for src/codegen.cpp since they have been superseded in the Julia master.

@JeffBezanson JeffBezanson merged commit 2abcadd into from
@Keno
Owner

Awesome. I did read through this yesterday and it looked great to me. Exciting to have this.

@StefanKarpinski

Yes, this is great.

@timholy
Collaborator

Wow, now it's time for me to learn how to use this. Thanks!

@timholy
Collaborator

Hmm. Quite easy to use :)

This could use a NEWS.md entry.

@StefanKarpinski

Isn't the usage basically:

  1. annotate with @simd
  2. get magic speedup

?

@jiahao
Collaborator

:cocktail:

This is the second piece of super exciting news I've gotten today

@timholy
Collaborator

Yes, that's what I discovered and meant by "quite easy to use" :). I was assuming it supported writing SIMD expressions, but this is more immediately useful to a wide audience.

@ArchRobison
Collaborator

Thanks to all who reviewed the code or offered encouraging comments.

To clarify the usage models, there are two:
A> Do nothing. For simple loops, auto-vectorization often kicks in if applicable.
B> Use @simd:
0. Check that the loop is amenable to vectorization, per restrictions.
1. Annotate with @simd
2. Often get magic speedup
Step 0 cannot be overemphasized, otherwise wrong results may ensue.

Usage Model A shouldn't be forgotten either. Just because @simd does not speed up a loop does not mean the loop didn't vectorize. Maybe it just vectorized anyway.

For now, vectorization either way will surely not happen without @inbounds.

@GunnarFarneback

My favorite testsuite ended up 130% slower by adding @simd in a critical loop. Hopefully this is #6343 related. (The results come out as expected and the generated code shows some addps and mulps after adding @simd.)

@cbecker

@GunnarFarneback it may be worth giving the code of #6343 branch a try, I wonder if that would yield the same results (100%) or, hopefully, faster.

@lencioni lencioni referenced this pull request in diffux/diffux
Closed

SnapshotComparisonImage::Overlayed is sloooooooooow #76

@ArchRobison
Collaborator

@GunnarFarneback : Did #6343 fix the testsuite slowdown?

@GunnarFarneback

Mostly but not completely. It went from 130% slowdown to about 10% slowdown.

@ArchRobison
Collaborator

@GunnarFarneback Can I get a copy of the testsuite to see what's going on?

@GunnarFarneback

The testsuite contains a lot of proprietary code but I'll try to extract a minimal testcase.

@ArchRobison
Collaborator

Compiler writers appreciate minimal. :smile:

@ViralBShah
Owner

@ArchRobison Would it be possible to enable @inbounds by default for all array accesses, when inside @simd scope?

@ArchRobison
Collaborator

@ViralBShah I've wondered about that myself. What's bothered me is conflating a performance knob with a no-safety-belt knob. That could give a bad rep to the performance knob. More precisely:

  • Currently, @simd is essentially pointless without @inbounds. So currently, making @simd imply @inbounds makes sense.
  • But in the future, I think we can eventually improve compilation of @simd so that it generates SIMD code even with bounds checking. Though the code might not be quite as good as code with @inbounds, particularly if sparse accesses (scatter/gather) are involved, since then the checks cannot be hoisted, but instead must be vectorized. So @inbounds would be a separate choice to make.

Perhaps an intermediate step would be to make @simd issue a performance warning if it encounters a subscript and @inbounds is off. Is there a way that a macro can check its scope for @inbounds? Alternatively, and perhaps a more comprehensive solution, would be recognize hopelessly unvectorizable cases in pass LowerSIMDLoop. I believe the check is easy to add there -- any basic block without successors will thwart the vectorizer. If I put the warning there, is it possible to issue a warning that will refer back to the offending Julia source line?

@ivarne
Collaborator

Why would the macro need to check its scope for @inbounds? Can't it walk the expression tree and check that no array references are visible when you exclude expressions under macrocall @inbounds?

@ArchRobison
Collaborator

I was thinking of the general case where @inbounds might be outside the @simd. But yes, if @inbounds is inside @simd, it would be easy for @simd to check.

@ViralBShah
Owner

One other question. This is perhaps not the right place, but I am trying to see if @simd will be able to speed up sparse matvec. It doesn't seem to help. Would it be possible to further characterize the cases where one may expect a speedup?

function simdmatvec(A::SparseMatrixCSC, x::AbstractVector, y::AbstractVector)
    nzv = A.nzval
    rv = A.rowval
    @inbounds for col = 1 : A.n
    xcol = x[col]
    k1 = A.colptr[col]
    k2 = A.colptr[col+1]-1
        @simd for k = k1:k2
            y[rv[k]] += nzv[k]*xcol
    end
    end
    y
end

Incidentally, here the @inbounds is outside the @simd block.

@JeffBezanson

You'll be happy to know that sparse multiply of all kinds already got much faster with this patch, without explicit @simd. For me:

Before:

sparsemul            86.268   88.148   87.075    0.837
sparsemul2           49.249   64.618   55.687    7.234
sparserange          44.261   67.173   55.907   11.141
matvec                9.048    9.085    9.071    0.014

After:

sparsemul            48.300   50.560   49.398    0.965
sparsemul2           47.177   60.579   52.202    6.758
sparserange          40.913   63.947   52.374   11.493
matvec                6.078    6.126    6.107    0.020
@tkelman
Owner

30% better on matvec is great! Interesting that the big improvement on sparse matmul happens for the dense matrix of ones represented in CSC form, where successive indices are adjacent. For the more randomized sparsity pattern in sparsemul2 it's only about 6% better. Harder to take advantage of SIMD with truly random access, but I'm starting to see why PETSc and other libraries go to the trouble of implementing block-sparse formats.

@ArchRobison
Collaborator

#5355 brought in type-based alias analysis, which I suspect is what helps the sparse matrix code, because it enables much better hoisting and common sub-expression elimination.

My understanding of the current state of the LLVM vectorizer is that @simd can only pay off in the following circumstances:

  1. The loop must be an innermost loop.
  2. The loop body must be straight-line code. That's why @inbounds is currently needed.
  3. Accesses must have a stride pattern. I.e., no "gathers" (random-index reads) or "scatters".(random-index writes).
  4. The stride should be unit stride. The vectorizer can deal with non-unit strides, but it uses scalar loads/stores in this case, which is likely to dominate the execution time.
  5. The number of arrays accessed is too large for LLVM's auto-vectorization to kick in (which #5355 also brought in). In fact, all that @simd does is tell the compiler "don't worry about cross-iteration dependencies". With only 2 or 3 arrays accessed by a loop, the auto-vectorization usually kicks in.

Speaking to Intel's instructions sets, Intel's AVX2 adds a gather instruction, though so far LLVM doesn't seem to know how to use it.

@ViralBShah
Owner

Thanks. I will update this in the docs. I was hoping that LLVM would be able to leverage the gather instruction for access patterns such as this one. That was why I was trying the matvec example. This has been very informative.

@mlubin
Collaborator

Perhaps a documentation issue: does SIMD work on a standard build or do we need to use a special version of LLVM? If so, what's required in Make.user?

@ArchRobison
Collaborator

It should work with a standard build.

@ArchRobison
Collaborator

I've posted long article on @simd, similar to my JuliaCon talk.

@jiahao jiahao referenced this pull request from a commit in JuliaLang/julialang.github.com
@jiahao jiahao Add @ArchRobison's vectorization article 9e338bc
@jakebolewski
Collaborator

@ArchRobison that article was fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
View
3  base/exports.jl
@@ -1319,4 +1319,5 @@ export
@sprintf,
@deprecate,
@boundscheck,
- @inbounds
+ @inbounds,
+ @simd
View
60 base/simdloop.jl
@@ -0,0 +1,60 @@
+# Support for @simd for
+
+module SimdLoop
+
+export @simd
+
+# Error thrown from ill-formed uses of @simd
+type SimdError <: Exception
+ msg::ASCIIString
+end
+
+# Parse iteration space expression
+# symbol '=' range
+# symbol 'in' range
+function parse_iteration_space( x )
+ if !Meta.isexpr(x,[:(=),:in])
+ throw( SimdError("= or in expected"))
+ elseif length(x.args)!=2
+ throw( SimdError("simd range syntax is wrong"))
+ elseif !isa(x.args[1],Symbol)
+ throw( SimdError("simd loop index must be a symbol"))
+ else
+ x.args # symbol, range
+ end
+end
+
+# Compile Expr x in context of @simd.
+function compile(x)
+ if !Meta.isexpr(x,:for)
+ throw(SimdError("for loop expected"))
+ elseif length(x.args)!=2
+ throw(SimdError("1D for loop expected"))
+ else
+ var,range = parse_iteration_space(x.args[1])
+ r = gensym() # Range
+ n = gensym() # Trip count
+ s = gensym() # Step
+ i = gensym() # Index variable
+ # LLVM vectorizer needs to compute a trip count, so make it obvious.
+ quote
+ local $r = $range
+ local $n = length($r)
+ local $s = step($r)
+ local $var = first($r)
+ local $i = zero($n)
+ while $i < $n
+ $(x.args[2])
+ $var += $s
+ $i += 1
+ $(Expr(:simdloop)) # Mark loop as SIMD loop
+ end
+ end
+ end
+end
+
+macro simd(forloop)
+ esc(compile(forloop))
+end
+
+end # simdloop
View
4 base/sysimg.jl
@@ -185,6 +185,10 @@ using .I18n
using .Help
push!(I18n.CALLBACKS, Help.clear_cache)
+# SIMD loops
+include("simdloop.jl")
@nolta Collaborator
nolta added a note

I think you might need a

importall .SimdLoop

here?

@ArchRobison Collaborator

Thanks! Now added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+importall .SimdLoop
+
# sparse matrices and linear algebra
include("sparse.jl")
importall .SparseMatrix
View
43 doc/manual/performance-tips.rst
@@ -449,6 +449,49 @@ These are some minor points that might help in tight inner loops.
- Use ``div(x,y)`` for truncating division of integers instead of
``trunc(x/y)``, and ``fld(x,y)`` instead of ``floor(x/y)``.
+Performance Annotations
+-----------------------
+
+Sometimes you can enable better optimization by promising certain program
+properties.
+
+- Use ``@inbounds`` to eliminate array bounds checking within expressions.
+ Be certain before doing this. If the subscripts are ever out of bounds,
+ you may suffer crashes or silent corruption.
+- Write ``@simd`` in front of ``for`` loops that are amenable to vectorization.
+ **This feature is experimental** and could change or disappear in future
+ versions of Julia.
+
+Here is an example with both forms of markup::
+
+ function tightloop( x, y, z )
+ s = zero(eltype(z))
+ n = min(length(x),length(y),length(z))
+ @simd for i in 1:n
+ @inbounds begin
+ z[i] = x[i]-y[i]
+ s += z[i]*z[i]
+ end
+ end
+ s
+ end
+
+The range for a ``@simd for`` loop should be a one-dimensional range.
+A variable used for accumulating, such as ``s`` in the example, is called
+a *reduction variable*. By using``@simd``, you are asserting several
+properties of the loop:
+
+- It is safe to execute iterations in arbitrary or overlapping order,
+ with special consideration for reduction variables.
+- Floating-point operations on reduction variables can be reordered,
+ possibly causing different results than without ``@simd``.
+- No iteration ever waits on another iteration to make forward progress.
+
+Using ``@simd`` merely gives the compiler license to vectorize. Whether
+it actually does so depends on the compiler. The current implementation
+will not vectorize if there are possible early exits from the loop, such
+as from array bounds checking. This limitation may be lifted in the future.
+
Tools
-----
View
2  src/Makefile
@@ -8,7 +8,7 @@ override CPPFLAGS += $(JCPPFLAGS)
SRCS = \
jltypes gf ast builtins module codegen interpreter \
- alloc dlload sys init task array dump toplevel jl_uv jlapi profile
+ alloc dlload sys init task array dump toplevel jl_uv jlapi profile llvm-simdloop
FLAGS = \
-D_GNU_SOURCE \
View
1  src/Windows.mk
@@ -31,6 +31,7 @@ OBJECTS = \
toplevel.obj \
jl_uv.obj \
jlapi.obj \
+ llvm-simdloop.obj \
gc.obj
LIBFLISP = flisp\libflisp.lib
View
1  src/alloc.c
@@ -87,6 +87,7 @@ jl_sym_t *compositetype_sym; jl_sym_t *type_goto_sym;
jl_sym_t *global_sym; jl_sym_t *tuple_sym;
jl_sym_t *dot_sym; jl_sym_t *newvar_sym;
jl_sym_t *boundscheck_sym; jl_sym_t *copyast_sym;
+jl_sym_t *simdloop_sym;
typedef struct {
int64_t a;
View
2  src/ccall.cpp
@@ -368,7 +368,7 @@ static Value *julia_to_native(Type *ty, jl_value_t *jt, Value *jv,
return builder.CreateBitCast(emit_arrayptr(jv), ty);
}
if (aty == (jl_value_t*)jl_ascii_string_type || aty == (jl_value_t*)jl_utf8_string_type) {
- return builder.CreateBitCast(emit_arrayptr(emit_nthptr(jv,1)), ty);
+ return builder.CreateBitCast(emit_arrayptr(emit_nthptr(jv,1,tbaa_const)), ty);
}
if (jl_is_structtype(aty) && jl_is_leaf_type(aty) && !jl_is_array_type(aty)) {
if (!addressOf) {
View
45 src/cgutils.cpp
@@ -664,6 +664,11 @@ static Value *mark_julia_type(Value *v, jl_value_t *jt)
return v;
}
+static Value *tbaa_decorate(MDNode* md, Instruction* load_or_store) {
+ load_or_store->setMetadata( llvm::LLVMContext::MD_tbaa, md );
+ return load_or_store;
+}
+
// --- generating various error checks ---
static jl_value_t *llvm_type_to_julia(Type *t, bool err=true);
@@ -733,7 +738,7 @@ static void raise_exception_unless(Value *cond, Value *exc, jl_codectx_t *ctx)
static void raise_exception_unless(Value *cond, GlobalVariable *exc,
jl_codectx_t *ctx)
{
- raise_exception_unless(cond, (Value*)builder.CreateLoad(exc, false), ctx);
+ raise_exception_unless(cond, (Value*)tbaa_decorate(tbaa_const,builder.CreateLoad(exc, false)), ctx);
}
static void raise_exception_if(Value *cond, Value *exc, jl_codectx_t *ctx)
@@ -839,18 +844,18 @@ static Value *emit_nthptr_addr(Value *v, Value *idx)
return builder.CreateGEP(builder.CreateBitCast(v, jl_ppvalue_llvmt), idx);
}
-static Value *emit_nthptr(Value *v, size_t n)
+static Value *emit_nthptr(Value *v, size_t n, MDNode *tbaa)
{
// p = (jl_value_t**)v; p[n]
Value *vptr = emit_nthptr_addr(v, n);
- return builder.CreateLoad(vptr, false);
+ return tbaa_decorate(tbaa,builder.CreateLoad(vptr, false));
}
-static Value *emit_nthptr(Value *v, Value *idx)
+static Value *emit_nthptr(Value *v, Value *idx, MDNode *tbaa)
{
// p = (jl_value_t**)v; p[n]
Value *vptr = emit_nthptr_addr(v, idx);
- return builder.CreateLoad(vptr, false);
+ return tbaa_decorate(tbaa,builder.CreateLoad(vptr, false));
}
static Value *typed_load(Value *ptr, Value *idx_0based, jl_value_t *jltype,
@@ -865,7 +870,7 @@ static Value *typed_load(Value *ptr, Value *idx_0based, jl_value_t *jltype,
data = builder.CreateBitCast(ptr, PointerType::get(elty, 0));
else
data = ptr;
- Value *elt = builder.CreateLoad(builder.CreateGEP(data, idx_0based), false);
+ Value *elt = tbaa_decorate(tbaa_user, builder.CreateLoad(builder.CreateGEP(data, idx_0based), false));
if (elty == jl_pvalue_llvmt) {
null_pointer_check(elt, ctx);
}
@@ -891,7 +896,7 @@ static Value *typed_store(Value *ptr, Value *idx_0based, Value *rhs,
data = builder.CreateBitCast(ptr, PointerType::get(elty, 0));
else
data = ptr;
- return builder.CreateStore(rhs, builder.CreateGEP(data, idx_0based));
+ return tbaa_decorate(tbaa_user, builder.CreateStore(rhs, builder.CreateGEP(data, idx_0based)));
}
// --- convert boolean value to julia ---
@@ -988,7 +993,7 @@ static Value *emit_tuplelen(Value *t,jl_value_t *jt)
return builder.CreateLShr(builder.CreatePtrToInt(lenbits, T_int64),
ConstantInt::get(T_int32, 52));
#else
- Value *lenbits = emit_nthptr(t, 1);
+ Value *lenbits = emit_nthptr(t, 1, tbaa_tuplelen);
return builder.CreatePtrToInt(lenbits, T_size);
#endif
}
@@ -1120,20 +1125,20 @@ static Value *emit_tupleref(Value *tuple, Value *ival, jl_value_t *jt, jl_codect
Intrinsic::stacksave));
builder.Insert(stacksave);
Value *tempSpace = builder.CreateAlloca(at);
- builder.CreateStore(tuple,tempSpace);
+ tbaa_decorate(tbaa_user, builder.CreateStore(tuple,tempSpace));
Value *idxs[2];
idxs[0] = ConstantInt::get(T_size,0);
idxs[1] = builder.CreateSub(ival,ConstantInt::get(T_size,1));
Value *v = builder.CreateGEP(tempSpace,ArrayRef<Value*>(&idxs[0],2));
if (idx) {
- v = mark_julia_type(builder.CreateLoad(v), jl_tupleref(jt,ci));
+ v = mark_julia_type(tbaa_decorate(tbaa_user, builder.CreateLoad(v)), jl_tupleref(jt,ci));
}
else {
jl_add_linfo_root(ctx->linfo, jt);
Value *lty = emit_tupleref(literal_pointer_val(jt), ival, jl_typeof(jt), ctx);
size_t i, l = jl_tuple_len(jt);
if (is_tupletype_homogeneous((jl_tuple_t*)jt) && jl_isbits(jl_t0(jt))) {
- v = mark_julia_type(builder.CreateLoad(v), jl_t0(jt));
+ v = mark_julia_type(tbaa_decorate(tbaa_user, builder.CreateLoad(v)), jl_t0(jt));
}
else {
for (i = 0; i < l; i++) {
@@ -1147,7 +1152,7 @@ static Value *emit_tupleref(Value *tuple, Value *ival, jl_value_t *jt, jl_codect
Value *nb = ConstantExpr::getSizeOf(at->getElementType());
if (sizeof(size_t)==4)
nb = builder.CreateTrunc(nb, T_int32);
- v = allocate_box_dynamic(lty, nb, builder.CreateLoad(v));
+ v = allocate_box_dynamic(lty, nb, tbaa_decorate(tbaa_user, builder.CreateLoad(v)));
}
}
}
@@ -1227,7 +1232,7 @@ static Value *emit_arraysize(Value *t, Value *dim)
#endif
Value *dbits =
emit_nthptr(t, builder.CreateAdd(dim,
- ConstantInt::get(dim->getType(), o)));
+ ConstantInt::get(dim->getType(), o)), tbaa_arraysize);
return builder.CreatePtrToInt(dbits, T_size);
}
@@ -1254,7 +1259,7 @@ static Value *emit_arraylen_prim(Value *t, jl_value_t *ty)
{
#ifdef STORE_ARRAY_LEN
(void)ty;
- Value *lenbits = emit_nthptr(t, 2);
+ Value *lenbits = emit_nthptr(t, 2, tbaa_arraylen);
return builder.CreatePtrToInt(lenbits, T_size);
#else
jl_value_t *p1 = jl_tparam1(ty);
@@ -1286,7 +1291,7 @@ static Value *emit_arraylen(Value *t, jl_value_t *ex, jl_codectx_t *ctx)
static Value *emit_arrayptr(Value *t)
{
- return emit_nthptr(t, 1);
+ return emit_nthptr(t, 1, tbaa_arrayptr);
}
static Value *emit_arrayptr(Value *t, jl_value_t *ex, jl_codectx_t *ctx)
@@ -1307,9 +1312,9 @@ static Value *emit_arraysize(Value *t, jl_value_t *ex, int dim, jl_codectx_t *ct
static void assign_arrayvar(jl_arrayvar_t &av, Value *ar)
{
- builder.CreateStore(builder.CreateBitCast(emit_arrayptr(ar),
- av.dataptr->getType()->getContainedType(0)),
- av.dataptr);
+ tbaa_decorate(tbaa_arrayptr,builder.CreateStore(builder.CreateBitCast(emit_arrayptr(ar),
+ av.dataptr->getType()->getContainedType(0)),
+ av.dataptr));
builder.CreateStore(emit_arraylen_prim(ar, av.ty), av.len);
for(size_t i=0; i < av.sizes.size(); i++)
builder.CreateStore(emit_arraysize(ar,i+1), av.sizes[i]);
@@ -1363,7 +1368,7 @@ static Value *emit_array_nd_index(Value *a, jl_value_t *ex, size_t nd, jl_value_
ctx->f->getBasicBlockList().push_back(failBB);
builder.SetInsertPoint(failBB);
- builder.CreateCall2(prepare_call(jlthrow_line_func), builder.CreateLoad(prepare_global(jlboundserr_var)),
+ builder.CreateCall2(prepare_call(jlthrow_line_func), tbaa_decorate(tbaa_const,builder.CreateLoad(prepare_global(jlboundserr_var))),
ConstantInt::get(T_int32, ctx->lineno));
builder.CreateUnreachable();
@@ -1585,7 +1590,7 @@ static void emit_cpointercheck(Value *x, const std::string &msg,
emit_typecheck(t, (jl_value_t*)jl_datatype_type, msg, ctx);
Value *istype =
- builder.CreateICmpEQ(emit_nthptr(t, offsetof(jl_datatype_t,name)/sizeof(char*)),
+ builder.CreateICmpEQ(emit_nthptr(t, offsetof(jl_datatype_t,name)/sizeof(char*), tbaa_datatype),
literal_pointer_val((jl_value_t*)jl_pointer_type->name));
BasicBlock *failBB = BasicBlock::Create(getGlobalContext(),"fail",ctx->f);
BasicBlock *passBB = BasicBlock::Create(getGlobalContext(),"pass");
View
87 src/codegen.cpp
@@ -59,6 +59,7 @@
#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/Attributes.h"
#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/MDBuilder.h"
#define LLVM33 1
#else
#include "llvm/DerivedTypes.h"
@@ -94,6 +95,9 @@
#include "llvm/Support/FormattedStream.h"
#include "llvm/Support/DynamicLibrary.h"
#include "llvm/Config/llvm-config.h"
+#ifdef DEBUG
+#include "llvm/Support/CommandLine.h"
+#endif
#include "llvm/Transforms/Utils/Cloning.h"
#include <setjmp.h>
@@ -141,6 +145,7 @@ static RTDyldMemoryManager *jl_mcjmm;
#else
static Module *jl_Module;
#endif
+static MDBuilder *mbuilder;
static std::map<int, std::string> argNumberStrings;
static FunctionPassManager *FPM;
@@ -183,6 +188,23 @@ static Type *T_float64;
static Type *T_pfloat64;
static Type *T_void;
+// type-based alias analysis nodes. Indentation of comments indicates hierarchy.
+static MDNode* tbaa_user; // User data
+static MDNode* tbaa_value; // Julia value
+static MDNode* tbaa_array; // Julia array
+static MDNode* tbaa_arrayptr; // The pointer inside a jl_array_t
+static MDNode* tbaa_arraysize; // A size in a jl_array_t
+static MDNode* tbaa_arraylen; // The len in a jl_array_t
+static MDNode* tbaa_tuplelen; // The len in a jl_tuple_t
+static MDNode* tbaa_func; // A jl_function_t
+static MDNode* tbaa_datatype; // A jl_datatype_t
+static MDNode* tbaa_const; // Memory that is immutable by the time LLVM can see it
+
+namespace llvm {
+ extern Pass *createLowerSimdLoopPass();
+ extern bool annotateSimdLoop( BasicBlock* latch );
+}
+
// constants
static Value *V_null;
@@ -1146,9 +1168,9 @@ static Value *emit_lambda_closure(jl_value_t *expr, jl_codectx_t *ctx)
if (vari.closureidx != -1) {
int idx = vari.closureidx;
#ifdef OVERLAP_TUPLE_LEN
- val = emit_nthptr((Value*)ctx->envArg, idx+1);
+ val = emit_nthptr((Value*)ctx->envArg, idx+1, tbaa_tuplelen);
#else
- val = emit_nthptr((Value*)ctx->envArg, idx+2);
+ val = emit_nthptr((Value*)ctx->envArg, idx+2, tbaa_tuplelen);
#endif
}
else {
@@ -1517,7 +1539,7 @@ static Value *emit_known_call(jl_value_t *ff, jl_value_t **args, size_t nargs,
else if (f->fptr == &jl_f_apply && nargs==2 && ctx->vaStack &&
symbol_eq(args[2], ctx->vaName)) {
Value *theF = emit_expr(args[1],ctx);
- Value *theFptr = builder.CreateBitCast(emit_nthptr(theF,1),jl_fptr_llvmt);
+ Value *theFptr = builder.CreateBitCast(emit_nthptr(theF,1, tbaa_func),jl_fptr_llvmt);
Value *nva = emit_n_varargs(ctx);
#ifdef _P64
nva = builder.CreateTrunc(nva, T_int32);
@@ -1541,8 +1563,8 @@ static Value *emit_known_call(jl_value_t *ff, jl_value_t **args, size_t nargs,
idx = emit_bounds_check(idx, valen, ctx);
idx = builder.CreateAdd(idx, ConstantInt::get(T_size, ctx->nReqArgs));
JL_GC_POP();
- return builder.
- CreateLoad(builder.CreateGEP(ctx->argArray,idx),false);
+ return tbaa_decorate(tbaa_user, builder.
+ CreateLoad(builder.CreateGEP(ctx->argArray,idx),false));
}
Value *arg1 = emit_expr(args[1], ctx);
if (jl_is_long(args[2])) {
@@ -1557,7 +1579,7 @@ static Value *emit_known_call(jl_value_t *ff, jl_value_t **args, size_t nargs,
}
if (idx==0 || (!isseqt && idx > tlen)) {
builder.CreateCall2(prepare_call(jlthrow_line_func),
- builder.CreateLoad(prepare_global(jlboundserr_var)),
+ tbaa_decorate(tbaa_const, builder.CreateLoad(prepare_global(jlboundserr_var))),
ConstantInt::get(T_int32, ctx->lineno));
JL_GC_POP();
return V_null;
@@ -1839,11 +1861,11 @@ static Value *emit_known_call(jl_value_t *ff, jl_value_t **args, size_t nargs,
if (is_structtype_all_pointers(stt)) {
idx = emit_bounds_check(idx, ConstantInt::get(T_size, nfields), ctx);
Value *fld =
- builder.
+ tbaa_decorate(tbaa_user, builder.
CreateLoad(builder.
CreateGEP(builder.
CreateBitCast(strct, jl_ppvalue_llvmt),
- builder.CreateAdd(idx,ConstantInt::get(T_size,1))));
+ builder.CreateAdd(idx,ConstantInt::get(T_size,1)))));
null_pointer_check(fld, ctx);
JL_GC_POP();
return fld;
@@ -2002,7 +2024,7 @@ static Value *emit_call(jl_value_t **args, size_t arglen, jl_codectx_t *ctx,
}
// extract pieces of the function object
// TODO: try extractvalue instead
- theFptr = builder.CreateBitCast(emit_nthptr(theFunc, 1), jl_fptr_llvmt);
+ theFptr = builder.CreateBitCast(emit_nthptr(theFunc, 1, tbaa_func), jl_fptr_llvmt);
theF = theFunc;
}
else {
@@ -2111,9 +2133,9 @@ static Value *var_binding_pointer(jl_sym_t *s, jl_binding_t **pbnd,
assert(((Value*)ctx->envArg)->getType() == jl_pvalue_llvmt);
if (isBoxed(s, ctx)) {
#ifdef OVERLAP_TUPLE_LEN
- return emit_nthptr_addr(emit_nthptr((Value*)ctx->envArg, idx+1), 1);
+ return emit_nthptr_addr(emit_nthptr((Value*)ctx->envArg, idx+1, tbaa_tuplelen), 1);
#else
- return emit_nthptr_addr(emit_nthptr((Value*)ctx->envArg, idx+2), 1);
+ return emit_nthptr_addr(emit_nthptr((Value*)ctx->envArg, idx+2, tbaa_tuplelen), 1);
#endif
}
#ifdef OVERLAP_TUPLE_LEN
@@ -2465,7 +2487,7 @@ static Value *emit_expr(jl_value_t *expr, jl_codectx_t *ctx, bool isboxed,
Value *bp;
if (iskw) {
// fenv = theF->env
- Value *fenv = emit_nthptr(theF, 2);
+ Value *fenv = emit_nthptr(theF, 2, tbaa_func);
// bp = &((jl_methtable_t*)fenv)->kwsorter
bp = emit_nthptr_addr(fenv, 7);
}
@@ -2678,6 +2700,12 @@ static Value *emit_expr(jl_value_t *expr, jl_codectx_t *ctx, bool isboxed,
}
return builder.CreateCall(prepare_call(jlcopyast_func), emit_expr(arg, ctx));
}
+ else if (head == simdloop_sym) {
+ if( !llvm::annotateSimdLoop(builder.GetInsertBlock()) )
+ JL_PRINTF(JL_STDERR,
+ "Warning: could not attach metadata for @simd loop.\n");
+ return NULL;
+ }
else {
if (!strcmp(head->name, "$"))
jl_error("syntax: prefix $ in non-quoted expression");
@@ -3304,7 +3332,7 @@ static Function *emit_function(jl_lambda_info_t *lam, bool cstyle)
// fetch env out of function object if we need it
if (hasCapt) {
- ctx.envArg = emit_nthptr(fArg, 2);
+ ctx.envArg = emit_nthptr(fArg, 2, tbaa_func);
}
// step 8. set up GC frame
@@ -3603,6 +3631,13 @@ static Function *emit_function(jl_lambda_info_t *lam, bool cstyle)
// --- initialization ---
+static MDNode* tbaa_make_child( const char* name, MDNode* parent, bool isConstant=false )
+{
+ MDNode* n = mbuilder->createTBAANode(name,parent,isConstant);
+ n->setValueName( ValueName::Create(name, name+strlen(name)));
+ return n;
+}
+
static GlobalVariable *global_to_llvm(const std::string &cname, void *addr, Module *m)
{
GlobalVariable *gv =
@@ -3676,6 +3711,18 @@ extern "C" DLLEXPORT jl_value_t *jl_new_box(jl_value_t *v)
static void init_julia_llvm_env(Module *m)
{
+ MDNode* tbaa_root = mbuilder->createTBAARoot("jtbaa");
+ tbaa_user = tbaa_make_child("jtbaa_user",tbaa_root);
+ tbaa_value = tbaa_make_child("jtbaa_value",tbaa_root);
+ tbaa_array = tbaa_make_child("jtbaa_array",tbaa_value);
+ tbaa_arrayptr = tbaa_make_child("jtbaa_arrayptr",tbaa_array);
+ tbaa_arraysize = tbaa_make_child("jtbaa_arraysize",tbaa_array);
+ tbaa_arraylen = tbaa_make_child("jtbaa_arraylen",tbaa_array);
+ tbaa_tuplelen = tbaa_make_child("jtbaa_tuplelen",tbaa_value);
+ tbaa_func = tbaa_make_child("jtbaa_func",tbaa_value);
+ tbaa_datatype = tbaa_make_child("jtbaa_datatype",tbaa_value);
+ tbaa_const = tbaa_make_child("jtbaa_const",tbaa_root,true);
+
// every variable or function mapped in this function must be
// exported from libjulia, to support static compilation
T_int1 = Type::getInt1Ty(getGlobalContext());
@@ -4039,6 +4086,10 @@ static void init_julia_llvm_env(Module *m)
#endif
FPM->add(jl_data_layout);
+#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR >= 3
+ jl_TargetMachine->addAnalysisPasses(*FPM);
+#endif
+ FPM->add(createTypeBasedAliasAnalysisPass());
// list of passes from vmkit
FPM->add(createCFGSimplificationPass()); // Clean up disgusting code
FPM->add(createPromoteMemoryToRegisterPass());// Kill useless allocas
@@ -4061,14 +4112,20 @@ static void init_julia_llvm_env(Module *m)
FPM->add(createLoopIdiomPass()); //// ****
FPM->add(createLoopRotatePass()); // Rotate loops.
+ // LoopRotate strips metadata from terminator, so run LowerSIMD afterwards
+ FPM->add(createLowerSimdLoopPass()); // Annotate loop marked with "simdloop" as LLVM parallel loop
FPM->add(createLICMPass()); // Hoist loop invariants
FPM->add(createLoopUnswitchPass()); // Unswitch loops.
+ // Subsequent passes not stripping metadata from terminator
FPM->add(createInstructionCombiningPass());
FPM->add(createIndVarSimplifyPass()); // Canonicalize indvars
//FPM->add(createLoopDeletionPass()); // Delete dead loops
FPM->add(createLoopUnrollPass()); // Unroll small loops
//FPM->add(createLoopStrengthReducePass()); // (jwb added)
+#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR >= 3
+ FPM->add(createLoopVectorizePass()); // Vectorize loops
+#endif
FPM->add(createInstructionCombiningPass()); // Clean up after the unroller
FPM->add(createGVNPass()); // Remove redundancies
//FPM->add(createMemCpyOptPass()); // Remove memcpy / form memset
@@ -4090,6 +4147,9 @@ static void init_julia_llvm_env(Module *m)
extern "C" void jl_init_codegen(void)
{
+#ifdef DEBUG
+ cl::ParseEnvironmentOptions("Julia", "JULIA_LLVM_ARGS");
+#endif
imaging_mode = jl_compileropts.build_path != NULL;
InitializeNativeTarget();
@@ -4174,6 +4234,7 @@ extern "C" void jl_init_codegen(void)
jl_ExecutionEngine = eb.create(jl_TargetMachine);
#endif // LLVM VERSION
jl_ExecutionEngine->DisableLazyCompilation();
+ mbuilder = new MDBuilder(getGlobalContext());
init_julia_llvm_env(m);
View
3  src/interpreter.c
@@ -444,6 +444,9 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
else if (ex->head == boundscheck_sym) {
return (jl_value_t*)jl_nothing;
}
+ else if (ex->head == simdloop_sym) {
+ return (jl_value_t*)jl_nothing;
+ }
jl_errorf("unsupported or misplaced expression %s", ex->head->name);
return (jl_value_t*)jl_nothing;
}
View
1  src/jltypes.c
@@ -3134,4 +3134,5 @@ void jl_init_types(void)
boundscheck_sym = jl_symbol("boundscheck");
newvar_sym = jl_symbol("newvar");
copyast_sym = jl_symbol("copyast");
+ simdloop_sym = jl_symbol("simdloop");
}
View
1  src/julia.h
@@ -427,6 +427,7 @@ extern jl_sym_t *abstracttype_sym; extern jl_sym_t *bitstype_sym;
extern jl_sym_t *compositetype_sym; extern jl_sym_t *type_goto_sym;
extern jl_sym_t *global_sym; extern jl_sym_t *tuple_sym;
extern jl_sym_t *boundscheck_sym; extern jl_sym_t *copyast_sym;
+extern jl_sym_t *simdloop_sym;
// object accessors -----------------------------------------------------------
View
174 src/llvm-simdloop.cpp
@@ -0,0 +1,174 @@
+#define DEBUG_TYPE "lower_simd_loop"
+#undef DEBUG
+
+// This file defines two entry points:
+// global function annotateSimdLoop: mark a loop as a SIMD loop.
+// createLowerSimdLoopPass: construct LLVM for lowering a marked loop later.
+
+#include "llvm/Analysis/LoopPass.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/LLVMContext.h"
+#include "llvm/IR/Metadata.h"
+#include "llvm/Support/Debug.h"
+#include <cstdio>
+
+namespace llvm {
+
+// simd loop
+static unsigned simd_loop_mdkind = 0;
+static MDNode* simd_loop_md = NULL;
+
+/// Mark loop as a SIMD loop. Return false if loop cannot be marked.
+/// incr should be the basic block that increments the loop counter.
+bool annotateSimdLoop(BasicBlock* incr) {
+ DEBUG(dbgs() << "LSL: annotating simd_loop\n");
+ // Lazy initialization
+ if (!simd_loop_mdkind) {
+ simd_loop_mdkind = getGlobalContext().getMDKindID("simd_loop");
+ simd_loop_md = MDNode::get(getGlobalContext(), ArrayRef<Value*>());
+ }
+ // Ideally, the decoration would go on the block itself, but LLVM 3.3 does not
+ // support putting metadata on blocks. So instead, put the decoration on the last
+ // Add instruction, which (somewhat riskily) is assumed to be the loop increment.
+ for (BasicBlock::reverse_iterator ri = incr->rbegin(); ri!=incr->rend(); ++ri) {
+ Instruction& i = *ri;
+ unsigned op = i.getOpcode();
+ if (op==Instruction::Add) {
+ if (i.getType()->isIntegerTy()) {
+ DEBUG(dbgs() << "LSL: setting simd_loop metadata\n");
+ i.setMetadata(simd_loop_mdkind, simd_loop_md);
+ return true;
+ } else {
+ return false;
+ }
+ }
+ }
+ return false;
+}
+
+/// Pass that lowers a loop marked by annotateSimdLoop.
+/// This pass should run after reduction variables have been converted to phi nodes,
+/// otherwise floating-point reductions might not be recognized as such and
+/// prevent SIMDization.
+struct LowerSIMDLoop: public LoopPass {
+ static char ID;
+ LowerSIMDLoop() : LoopPass(ID) {}
+
+private:
+ /*override*/ bool runOnLoop(Loop *, LPPassManager &LPM);
+
+ /// Check if loop has "simd_loop" annotation.
+ /// If present, the annotation is an MDNode attached to an instruction in the loop's latch.
+ bool hasSIMDLoopMetadata( Loop *L) const;
+
+ /// If Phi is part of a reduction cycle of FAdd or FMul, mark the ops as permitting reassociation/commuting.
+ void enableUnsafeAlgebraIfReduction(PHINode* Phi, Loop* L) const;
+};
+
+bool LowerSIMDLoop::hasSIMDLoopMetadata(Loop *L) const {
+ // Note: If a loop has 0 or multiple latch blocks, it's probably not a simd_loop anyway.
+ if (BasicBlock* latch = L->getLoopLatch())
+ for (BasicBlock::iterator II = latch->begin(), EE = latch->end(); II!=EE; ++II)
+ if (II->getMetadata(simd_loop_mdkind))
+ return true;
+ return false;
+}
+
+void LowerSIMDLoop::enableUnsafeAlgebraIfReduction(PHINode* Phi, Loop* L) const {
+ typedef SmallVector<Instruction*, 8> chainVector;
+ chainVector chain;
+ Instruction *J;
+ unsigned opcode = 0;
+ for (Instruction *I = Phi; ; I=J) {
+ J = NULL;
+ // Find the user of instruction I that is within loop L.
+ for (Value::use_iterator UI = I->use_begin(), UE = I->use_end(); UI != UE; ++UI) {
+ Instruction *U = cast<Instruction>(*UI);
+ if (L->contains(U)) {
+ if (J) {
+ DEBUG(dbgs() << "LSL: not a reduction var because op has two internal uses: " << *I << "\n");
+ return;
+ }
+ J = U;
+ }
+ }
+ if (!J) {
+ DEBUG(dbgs() << "LSL: chain prematurely terminated at " << *I << "\n");
+ return;
+ }
+ if (J==Phi) {
+ // Found the entire chain.
+ break;
+ }
+ if (opcode) {
+ // Check that arithmetic op matches prior arithmetic ops in the chain.
+ if (J->getOpcode()!=opcode) {
+ DEBUG(dbgs() << "LSL: chain broke at " << *J << " because of wrong opcode\n");
+ return;
+ }
+ } else {
+ // First arithmetic op in the chain.
+ opcode = J->getOpcode();
+ if (opcode!=Instruction::FAdd && opcode!=Instruction::FMul) {
+ DEBUG(dbgs() << "LSL: first arithmetic op in chain is uninteresting" << *J << "\n");
+ return;
+ }
+ }
+ chain.push_back(J);
+ }
+ for (chainVector::const_iterator K=chain.begin(); K!=chain.end(); ++K) {
+ DEBUG(dbgs() << "LSL: marking " << **K << "\n");
+ (*K)->setHasUnsafeAlgebra(true);
+ }
+}
+
+bool LowerSIMDLoop::runOnLoop(Loop *L, LPPassManager &LPM) {
+ if (!simd_loop_mdkind)
+ return false; // Fast rejection test.
+
+ if (!hasSIMDLoopMetadata(L))
+ return false;
+
+ DEBUG(dbgs() << "LSL: simd_loop found\n");
+#if defined(LLVM_VERSION_MAJOR) && LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR >= 4
+ MDNode* n = L->getLoopID();
+ if( !n ) {
+ // Loop does not have a LoopID yet, so give it one.
+ n = MDNode::get(getGlobalContext(), ArrayRef<Value*>(NULL));
+ n->replaceOperandWith(0,n);
+ L->setLoopID(n);
+ }
+#else
+ MDNode* n = MDNode::get(getGlobalContext(), ArrayRef<Value*>());
+ L->getLoopLatch()->getTerminator()->setMetadata("llvm.loop.parallel", n);
+#endif
+ MDNode* m = MDNode::get(getGlobalContext(), ArrayRef<Value*>(n));
+
+ // Mark memory references so that Loop::isAnnotatedParallel will return true for this loop.
+ for(Loop::block_iterator BBI = L->block_begin(), E=L->block_end(); BBI!=E; ++BBI)
+ for (BasicBlock::iterator I = (*BBI)->begin(), EE = (*BBI)->end(); I!=EE; ++I)
+ if (I->mayReadOrWriteMemory())
+ I->setMetadata("llvm.mem.parallel_loop_access", m);
+ assert(L->isAnnotatedParallel());
+
+ // Mark floating-point reductions as okay to reassociate/commute.
+ BasicBlock* Lh = L->getHeader();
+ DEBUG(dbgs() << "LSL: loop header: " << *Lh << "\n");
+ for (BasicBlock::iterator I = Lh->begin(), E = Lh->end(); I!=E; ++I)
+ if (PHINode *Phi = dyn_cast<PHINode>(I))
+ enableUnsafeAlgebraIfReduction(Phi,L);
+
+ return true;
+}
+
+char LowerSIMDLoop::ID = 0;
+
+static RegisterPass<LowerSIMDLoop> X("LowerSIMDLoop", "LowerSIMDLoop Pass",
+ false /* Only looks at CFG */,
+ false /* Analysis Pass */);
+
+Pass* createLowerSimdLoopPass() {
+ return new LowerSIMDLoop();
+}
+
+} // namespace llvm
View
2  test/Makefile
@@ -2,7 +2,7 @@ JULIAHOME = $(abspath ..)
include ../Make.inc
TESTS = all core keywordargs numbers strings unicode collections hashing \
- remote iobuffer arrayops linalg blas fft dsp sparse bitarray random \
+ remote iobuffer arrayops simdloop linalg blas fft dsp sparse bitarray random \
math functional bigint sorting statistics spawn parallel arpack file \
git pkg resolve suitesparse complex version pollfd mpfr broadcast \
socket floatapprox priorityqueue readdlm regex float16 combinatorics \
View
2  test/runtests.jl
@@ -1,7 +1,7 @@
# linalg tests take the longest - start them off first
testnames = [
"linalg1", "linalg2", "core", "keywordargs", "numbers", "strings",
- "collections", "hashing", "remote", "iobuffer", "arrayops",
+ "collections", "hashing", "remote", "iobuffer", "arrayops", "simdloop",
"blas", "fft", "dsp", "sparse", "bitarray", "random", "math",
"functional", "bigint", "sorting", "statistics", "spawn",
"priorityqueue", "arpack", "file", "suitesparse", "version",
View
44 test/simdloop.jl
@@ -0,0 +1,44 @@
+function simd_loop_example_from_manual(x, y, z)
+ s = zero(eltype(z))
+ n = min(length(x),length(y),length(z))
+ @simd for i in 1:n
+ @inbounds begin
+ z[i] = x[i]-y[i]
+ s += z[i]*z[i]
+ end
+ end
+ s
+end
+
+function simd_loop_with_multiple_reductions(x, y, z)
+ # Use non-zero initial value to make sure reduction values include it.
+ (s,t) = (one(eltype(x)),one(eltype(y)))
+ @simd for i in 1:length(z)
+ @inbounds begin
+ s += x[i]
+ t += 2*y[i]
+ s += z[i] # Two reductions go into s
+ end
+ end
+ (s,t)
+end
+
+for T in {Int32,Int64,Float32,Float64}
+ # Try various lengths to make sure "remainder loop" works
+ for n in {0,1,2,3,4,255,256,257}
+ # Dataset chosen so that results will be exact with only 24 bits of mantissa
+ a = convert(Array{T},[2*j+1 for j in 1:n])
+ b = convert(Array{T},[3*j+2 for j in 1:n])
+ c = convert(Array{T},[5*j+3 for j in 1:n])
+ s = simd_loop_example_from_manual(a,b,c)
+
+ @test a==[2*j+1 for j in 1:n]
+ @test b==[3*j+2 for j in 1:n]
+ @test c==[-j-1 for j in 1:n]
+ @test s==sum(c.*c)
+ (s,t) = simd_loop_with_multiple_reductions(a,b,c)
+ @test s==sum(a)+sum(c)+1
+ @test t==2*sum(b)+1
+ end
+end
+
Something went wrong with that request. Please try again.