SLP vectorization not working for tuples #11899

jrevels · 2015-06-27T21:48:31Z

After reading #6271, I wanted to reproduce the vectorization that @ArchRobison displayed in his tuple example. Unfortunately, this doesn't seem to work any longer (just a copy-paste of the example given in the aforementioned PR):

$ julia -O
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5513 (2015-06-22 15:58 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit faae194* (5 days old master)
|__/                   |  x86_64-apple-darwin13.4.0

julia> function add( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
           (a[1]+b[1],a[2]+b[2],a[3]+b[3],a[4]+b[4])
       end
add (generic function with 1 method)

julia> function mul( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
           (a[1]*b[1],a[2]*b[2],a[3]*b[3],a[4]*b[4])
       end
mul (generic function with 1 method)

julia> function madd( a::NTuple{4,Float32}, b::NTuple{4,Float32}, c::NTuple{4,Float32} )
           add(mul(a,b),c)
       end
madd (generic function with 1 method)

julia> t = NTuple{4,Float32}
Tuple{Float32,Float32,Float32,Float32}

julia> code_llvm(madd,(t,t,t))

define [4 x float] @julia_madd_21498([4 x float]*, [4 x float]*, [4 x float]*) {
top:
  %3 = getelementptr inbounds [4 x float]* %0, i64 0, i64 0
  %4 = load float* %3, align 4
  %5 = getelementptr inbounds [4 x float]* %1, i64 0, i64 0
  %6 = load float* %5, align 4
  %7 = fmul float %4, %6
  %8 = getelementptr inbounds [4 x float]* %0, i64 0, i64 1
  %9 = load float* %8, align 4
  %10 = getelementptr inbounds [4 x float]* %1, i64 0, i64 1
  %11 = load float* %10, align 4
  %12 = fmul float %9, %11
  %13 = getelementptr inbounds [4 x float]* %0, i64 0, i64 2
  %14 = load float* %13, align 4
  %15 = getelementptr inbounds [4 x float]* %1, i64 0, i64 2
  %16 = load float* %15, align 4
  %17 = fmul float %14, %16
  %18 = getelementptr inbounds [4 x float]* %0, i64 0, i64 3
  %19 = load float* %18, align 4
  %20 = getelementptr inbounds [4 x float]* %1, i64 0, i64 3
  %21 = load float* %20, align 4
  %22 = fmul float %19, %21
  %23 = getelementptr inbounds [4 x float]* %2, i64 0, i64 0
  %24 = load float* %23, align 4
  %25 = fadd float %7, %24
  %26 = insertvalue [4 x float] undef, float %25, 0
  %27 = getelementptr inbounds [4 x float]* %2, i64 0, i64 1
  %28 = load float* %27, align 4
  %29 = fadd float %12, %28
  %30 = insertvalue [4 x float] %26, float %29, 1
  %31 = getelementptr inbounds [4 x float]* %2, i64 0, i64 2
  %32 = load float* %31, align 4
  %33 = fadd float %17, %32
  %34 = insertvalue [4 x float] %30, float %33, 2
  %35 = getelementptr inbounds [4 x float]* %2, i64 0, i64 3
  %36 = load float* %35, align 4
  %37 = fadd float %22, %36
  %38 = insertvalue [4 x float] %34, float %37, 3
  ret [4 x float] %38
}

Caveat: I built v0.4 using LLVM 3.6, which wasn't tested in #6271.

ScottPJones · 2015-06-27T22:16:51Z

3.6.0 or 3.6.1? (not sure it would make any difference)

jrevels · 2015-06-27T22:24:52Z

@ScottPJones 3.6.1

I'm currently building with override LLVM_VER=svn to see what pops out (though I doubt moving forward in the build will actually fix the issue).

ScottPJones · 2015-06-27T22:26:54Z

OK, that's 3.7, not 3.6 (I've also been building with both 3.6.1 (release) and 3.7 (svn))

ScottPJones · 2015-06-27T22:28:31Z

I've been having problems with LLVN-svn, today, and even the release 0.3.10 😞

jrevels · 2015-06-27T22:28:31Z

Sorry - when I said "currently", I meant "at this very moment I'm rebuilding with 3.7". The version I was running on a couple of minutes ago when I filed this issue was 3.6.1

ScottPJones · 2015-06-27T23:41:16Z

Ah, OK, my misunderstanding! Brain tired after having 4 days of drinking from the proverbial fire hose!

ArchRobison · 2015-06-28T02:24:44Z

I can replicate the problem. I'll look into it next week.

ArchRobison · 2015-06-29T22:10:51Z

I see the problem. Julia 0.3 represents homogeneous tuples as LLVM vectors; Julia 0.4 represents them as LLVM arrays. Doing so torpedoes SLP vectorizer since LLVM has no vector instructions for arrays.

What's the motivation for the change? Short homogeneous tuples seem like a nice abstraction for hardware vectors. (My apologies for missing whatever conversation led to the change.)

StefanKarpinski · 2015-06-29T22:22:59Z

cc @vtjnash, @Keno, @JeffBezanson

ArchRobison · 2015-06-29T22:50:42Z

Looks like #11187 was the motivation. In the commit I see:

    this additionally prepares for turning back on allocating tuples as vectors,
    since the gc now guarantees 16-byte alignment

And the code comments say:

+                    bool validVectorSize = (ntypes == 2 || ntypes == 4 || ntypes == 6);
+                    if (0 && lasttype->isSingleValueType() && !lasttype->isVectorTy() && validVectorSize) // currently disabled due to load/store alignment issues
+                        jst->struct_decl = VectorType::get(lasttype, ntypes);

Is there any way I can help with the load/store alignment issues? We should also consider refining validVectorSize so that we get the LLVM vector representation for 16-wide float vectors if the target machine has AVX-512, and likewise for 8-wide float vectors on AVX2 machines. The test should perhaps consider the bit width of the tuple instead of the number of elements.

carnaval · 2015-06-29T23:20:26Z

We probably want to wait for Jameson's codegen rewrite to do that, so that we can make choices on storage on a per-case basis instead of having to decide only based on the type itself.

The vector alignment was part of this I think, since if we declare that all homogeneous tuples of size < K have to be llvm vector types, then they also have to be that when inside a struct, which would prevent us from having C compatible layout because of alignment.

carnaval · 2015-06-29T23:22:46Z

cc @vtjnash to check. But I think with your change we could have tuples be llvm vectors only when we got them out of an array or are passing them as specsig function arguments (and homogeneous, and of a reasonable size, given the buggy handling of weird sized vectors by llvm)

vtjnash · 2015-06-30T04:29:36Z

Looks like #11187 was the motivation. In the commit I see:

it was turned off prior to then. i did a bit of work in #11187 trying to make it possible to turn back on, but didn't manage to get it fully working.

Is there any way I can help with the load/store alignment issues? We should also consider refining validVectorSize so that we get the LLVM vector representation for 16-wide float vectors if the target machine has AVX-512, and likewise for 8-wide float vectors on AVX2 machines. The test should perhaps consider the bit width of the tuple instead of the number of elements.

if you try turning it on, see which tests fail / segfault due to incorrect alignment assumptions

We probably want to wait for Jameson's codegen rewrite to do that, so that we can make choices on storage on a per-case basis instead of having to decide only based on the type itself.

yes, please! it's almost done.

The vector alignment was part of this I think, since if we declare that all homogeneous tuples of size < K have to be llvm vector types, then they also have to be that when inside a struct, which would prevent us from having C compatible layout because of alignment.

cc @vtjnash to check. But I think with your change we could have tuples be llvm vectors only when we got them out of an array or are passing them as specsig function arguments (and homogeneous, and of a reasonable size, given the buggy handling of weird sized vectors by llvm)

i would love to hear ideas for handling this sanely. currently the behavior of tuples in llvmcall / ccall is officially undefined (even though we have a test for it), since it isn't entirely clear when tuples/structs should turn into vector types, and when they should be array types. maybe we should special-case a MMX type in codegen, like we have for floats, signed, and complex?

additionally, since the new AVX stuff requires > 16 byte alignment, it will be a bit complicated to allocate them since both system malloc and Julia's GC only guarantee 16 byte alignment. as a workaround, it may be best to just force LLVM to use the unaligned vmov instructions.

ArchRobison · 2015-06-30T15:08:11Z

I'm looking forward to the codegen rewrite.

I concur with using unaligned vector move instructions for the >16 byte alignment cases unless the compiler can prove the vector is aligned.

mlubin · 2015-07-05T00:51:58Z

So this is blocked by #11973? (And hence won't be fixed before 0.4)

ArchRobison · 2015-07-06T17:32:46Z

It would seem so, unless we want to hack an ad-hoc patch for the current codegen, which is probably not worth the effort.

ArchRobison · 2015-08-04T19:58:13Z

I've worked out an ABI and patch. All the tests pass. I'll wait for #11973 to be committed so I can restructure it for that.

The ABI that I came up with uses an LLVM vector for a tuple under the following conditions:

The tuple is homogeneous
The corresponding element type is legal as an element of an LLVM vector.
The tuple has length 2, 4, or 8.
The tuple is not a sub-object of a larger object.
The tuple is not being passed to C or llvmcall.
Otherwise it is represented as an array.

Rule 4 ensures layouts have the same offsets as before. Rules 4 and 5 sometimes require bitwise conversion between arrays and tuples, though LLVM seems good about removing gratuitous ones.

The net effect of these rules is that LLVM vector types are used when vector-like tuples act as return values, arguments to specialized signatures, local variables, or SSA values.

ArchRobison · 2015-10-02T17:00:58Z

Good news/bad news:

I got this working again with the latest master branch, and it makes the SLP vectorization work for tuples again.
When SLP vectorization does not kick in, it slows down obvious examples.
An example is this toy benchmark. Using a i4770 ("Haswell") and LLVM 3.7, I get:
- With the changes and -O, the x86 code is vectorized nicely.
- With the changes and without -O, the x86 code is ugly. The LLVM code is relatively clean, but the x86 backend is making a mess out of it.
- Without the changes and without -O, the x86 code is scalar, but clean. Though scalar, superscalar execution is quite effective at overlapping the operations.

Changes are in my repository under branch adr/simdtuple.

There seem to be three ways to take this forward:

Fix SLP Vectorizer to deal with structs, so homogeneous tuples won't need to be treated as a special case at the Julia level.
Fix LLVM's x86 backend.
Give up on SLP Vectorizer and use LLVM call to get at the vector instructions.

I'm leaning in the direction of 1 since it's the most general solution. E.g., it would enable SIMDization of an immutable type with "red", "green", "blue", and "alpha" fields of type float32. Though as an Intel employee, option 2 is easy to justify for reasons of pride and my time. :-)

I'll mull this some more.

simonster · 2015-10-02T17:05:33Z

When we switch to LLVM 3.7 (#9336) we can just enable -O all the time, right? In #6271 it looked like the performance difference with LLVM 3.5 was negligible, so unless this regressed between then and LLVM 3.7 we should be okay.

ArchRobison · 2015-10-02T18:22:02Z

The problem will be codes that use homogeneous tuples in way that is not amenable to SIMD instructions. E.g., elementwise addition of the first three elements, and a subtraction for the fourth. My change as it currently is would slow down those kinds of codes.

ScottPJones · 2015-10-02T19:25:01Z

But, 1 & 2 are not mutually exclusive, are they? Both sound like they should be done at some point.

ArchRobison · 2015-10-30T18:05:04Z

Update: I submitted an LLVM patch yesterday that makes SLPVectorization work with tuples again. Once it passes code review, it shouldn't be difficult to backport as a patch against LLVM 3.7. The patch not only enables SLP vectorization of tuples; it also enables SLP vectorization of immutable types.

This gist shows two examples, one with tuple and one with immutable, that both vectorize with the patch.

ScottPJones · 2015-10-30T19:42:05Z

Nice! What sort of results did you get before/after from the examples in your gist?

jrevels · 2015-10-31T22:19:15Z

The patch not only enables SLP vectorization of tuples; it also enables SLP vectorization of immutable types.

This looks awesome! Looking forward to it.

KristofferC · 2017-01-22T13:09:58Z

This vectorizes when using -O3.

julia> code_llvm(madd,(t,t,t))

define void @julia_madd_65810([4 x float]* noalias sret, [4 x float]*, [4 x float]*, [4 x float]*) #0 !dbg !5 {
top:
  %4 = bitcast [4 x float]* %1 to <4 x float>*
  %5 = load <4 x float>, <4 x float>* %4, align 4
  %6 = bitcast [4 x float]* %2 to <4 x float>*
  %7 = load <4 x float>, <4 x float>* %6, align 4
  %8 = fmul <4 x float> %5, %7
  %9 = bitcast [4 x float]* %3 to <4 x float>*
  %10 = load <4 x float>, <4 x float>* %9, align 4
  %11 = fadd <4 x float> %8, %10
  %12 = bitcast [4 x float]* %0 to <4 x float>*
  store <4 x float> %11, <4 x float>* %12, align 4
  ret void
}

ArchRobison mentioned this issue Jun 29, 2015

It's too easy to pass large tuples by value #11187

Closed

IainNZ added performance Must go faster regression Regression in behavior compared to a previous version labels Jun 30, 2015

jrevels mentioned this issue Nov 6, 2015

CI Performance Tracking for v0.5 #13893

Closed

4 tasks

KristofferC closed this as completed Jan 22, 2017

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 22, 2017

KristofferC mentioned this issue Aug 17, 2018

add some regression tests from base JuliaCI/BaseBenchmarks.jl#226

Merged

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLP vectorization not working for tuples #11899

SLP vectorization not working for tuples #11899

jrevels commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

jrevels commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

jrevels commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

ArchRobison commented Jun 28, 2015

ArchRobison commented Jun 29, 2015

StefanKarpinski commented Jun 29, 2015

ArchRobison commented Jun 29, 2015

carnaval commented Jun 29, 2015

carnaval commented Jun 29, 2015

vtjnash commented Jun 30, 2015

ArchRobison commented Jun 30, 2015

mlubin commented Jul 5, 2015

ArchRobison commented Jul 6, 2015

ArchRobison commented Aug 4, 2015

ArchRobison commented Oct 2, 2015

simonster commented Oct 2, 2015

ArchRobison commented Oct 2, 2015

ScottPJones commented Oct 2, 2015

ArchRobison commented Oct 30, 2015

ScottPJones commented Oct 30, 2015

jrevels commented Oct 31, 2015

KristofferC commented Jan 22, 2017 •

edited

SLP vectorization not working for tuples #11899

SLP vectorization not working for tuples #11899

Comments

jrevels commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

jrevels commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

jrevels commented Jun 27, 2015

ScottPJones commented Jun 27, 2015

ArchRobison commented Jun 28, 2015

ArchRobison commented Jun 29, 2015

StefanKarpinski commented Jun 29, 2015

ArchRobison commented Jun 29, 2015

carnaval commented Jun 29, 2015

carnaval commented Jun 29, 2015

vtjnash commented Jun 30, 2015

ArchRobison commented Jun 30, 2015

mlubin commented Jul 5, 2015

ArchRobison commented Jul 6, 2015

ArchRobison commented Aug 4, 2015

ArchRobison commented Oct 2, 2015

simonster commented Oct 2, 2015

ArchRobison commented Oct 2, 2015

ScottPJones commented Oct 2, 2015

ArchRobison commented Oct 30, 2015

ScottPJones commented Oct 30, 2015

jrevels commented Oct 31, 2015

KristofferC commented Jan 22, 2017 • edited

KristofferC commented Jan 22, 2017 •

edited