Tuples made me sad (so I fixed them) #4042

Keno · 2013-08-13T06:37:50Z

Alright, this rabbit hole turned out to be a giant dungeon instead, but I think I have finally found the exit.

Basically what this does is prevent heap allocation of tuples whenever possible, instead passing them in registers or on the stack (or wherever llvm thinks is most efficient). An easy example is this:

julia> bar() = (1,2)

julia> code_llvm(bar,())

; Function Attrs: sspreq
define <2 x i64> @julia_bar714() #2 {
top:
  ret <2 x i64> <i64 1, i64 2>, !dbg !3834
}

whereas before it used to heap allocate it (albeit it was smart enough to recognize that the result was constant, it only did it once, but still: (from the old version)

julia> bar() = (1,2)

julia> code_llvm(bar,())

define %jl_value_t* @julia_bar() {
top:
  ret %jl_value_t* inttoptr (i64 62429664 to %jl_value_t*), !dbg !3407
}

A perhaps more exciting and real work application is size on various array, e.g.

julia> code_llvm(size,(Array{Int64,2},))

; Function Attrs: sspreq
define <2 x i64> @julia_size(%jl_value_t*) #2 {
top:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 4, i32 0, !dbg !3876
  %2 = load %jl_value_t** %1, align 8, !dbg !3876
  %3 = ptrtoint %jl_value_t* %2 to i64, !dbg !3876
  %4 = insertelement <2 x i64> undef, i64 %3, i32 0, !dbg !3881, !julia_type !3882
  %5 = getelementptr inbounds %jl_value_t* %0, i64 5, i32 0, !dbg !3881
  %6 = load %jl_value_t** %5, align 8, !dbg !3881
  %7 = ptrtoint %jl_value_t* %6 to i64, !dbg !3881
  %8 = insertelement <2 x i64> %4, i64 %7, i32 1, !dbg !3881, !julia_type !3882
  ret <2 x i64> %8, !dbg !3881
}

Note that there is now no allocation involved, whereas before (again from the old version):

julia> code_llvm(size,(Array{Int64,2},))

define %jl_value_t* @julia_size(%jl_value_t*, %jl_value_t**, i32) {
top:
  %3 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 2, !dbg !3419
  store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %.sub, align 8
  %5 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 1, !dbg !3419
  %6 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !3419
  %.c = bitcast %jl_value_t** %6 to %jl_value_t*, !dbg !3419
  store %jl_value_t* %.c, %jl_value_t** %5, align 8, !dbg !3419
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !3419
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %7 = load %jl_value_t** %1, align 8, !dbg !3419
  %8 = getelementptr inbounds %jl_value_t* %7, i64 4, i32 0, !dbg !3419
  %9 = load %jl_value_t** %8, align 8, !dbg !3419
  %10 = ptrtoint %jl_value_t* %9 to i64, !dbg !3419
  %11 = call %jl_value_t* @jl_box_int64(i64 %10), !dbg !3424
  store %jl_value_t* %11, %jl_value_t** %4, align 8, !dbg !3424
  %12 = call %jl_value_t* @allocobj(i64 32), !dbg !3424
  %13 = getelementptr inbounds %jl_value_t* %12, i64 2, i32 0, !dbg !3424
  store %jl_value_t* %11, %jl_value_t** %13, align 8, !dbg !3424
  %14 = getelementptr inbounds %jl_value_t* %12, i64 0, i32 0, !dbg !3424
  store %jl_value_t* inttoptr (i64 28847136 to %jl_value_t*), %jl_value_t** %14, align 8, !dbg !3424
  %15 = getelementptr inbounds %jl_value_t* %12, i64 1, i32 0, !dbg !3424
  store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %15, align 8, !dbg !3424
  %16 = getelementptr inbounds %jl_value_t* %12, i64 3, i32 0, !dbg !3424
  store %jl_value_t* null, %jl_value_t** %16, align 8, !dbg !3424
  store %jl_value_t* %12, %jl_value_t** %4, align 8, !dbg !3424
  %17 = getelementptr inbounds %jl_value_t* %7, i64 5, i32 0, !dbg !3424
  %18 = load %jl_value_t** %17, align 8, !dbg !3424
  %19 = ptrtoint %jl_value_t* %18 to i64, !dbg !3424
  %20 = call %jl_value_t* @jl_box_int64(i64 %19), !dbg !3424
  store %jl_value_t* %20, %jl_value_t** %16, align 8, !dbg !3424
  %21 = load %jl_value_t** %5, align 8, !dbg !3424
  %22 = getelementptr inbounds %jl_value_t* %21, i64 0, i32 0, !dbg !3424
  store %jl_value_t** %22, %jl_value_t*** @jl_pgcstack, align 8, !dbg !3424
  ret %jl_value_t* %12, !dbg !3424
}

Another exciting part of this pull request that if your types happen to correspond to a machine vector type (aka SIMD type), I make it just that, allowing fun things like

julia> foo(x::(Int32,Int32,Int32,Int32)) = llvmcall("""%3 = add <4 x i32> %1, %0
       ret <4 x i32> %3""",(Int32,Int32,Int32,Int32),((Int32,Int32,Int32,Int32),(Int32,Int32,Int32,Int32)),(int32(1),int32(2),int32(3),int32(4)),
       x)
# methods for generic function foo
foo(x::(Int32,Int32,Int32,Int32)) at none:1

julia> code_llvm(foo,((Int32,Int32,Int32,Int32),))
; Number of arguments: 2
define <4 x i32> @julia_foo1(<4 x i32> ,<4 x i32> ) {
%3 = add <4 x i32> %1, %0
ret <4 x i32> %3
}
define linkonce_odr <4 x i32> @julia_foo1(<4 x i32>, <4 x i32>) {
  %3 = add <4 x i32> %1, %0
  ret <4 x i32> %3
}


; Function Attrs: sspreq
define <4 x i32> @julia_foo(<4 x i32>) #2 {
top:
  %1 = add <4 x i32> %0, <i32 1, i32 2, i32 3, i32 4>
  ret <4 x i32> %1, !dbg !3798
}

julia> code_native(foo,((Int32,Int32,Int32,Int32),))
    .text
Filename: none
Source line: 1
    push    RBP
    mov RBP, RSP
    sub RSP, 16
    mov RAX, QWORD PTR FS:[40]
    mov QWORD PTR [RBP - 8], RAX
    mov RAX, QWORD PTR FS:[40]
    cmp RAX, QWORD PTR [RBP - 8]
    jne 20
    movabs  RAX, 140737297706192
    paddd   XMM0, XMMWORD PTR [RAX]
Source line: 1
    add RSP, 16
    pop RBP
    ret
    movabs  RAX, 140737322375370
    call    RAX

Note in particular the paddd in the native disassembly. The llvmcall used here is a new intrinsic to easily embed any LLVM IR into your julia functions. It is part of this pull request for demonstration purposes mostly, but is fully featured and ready to be used, but if people would rather not have it, I can rebase it out.

The tuple unboxing currently is only enabled for immutable types due to GC considerations, but I will follow up with that patch after discussion on this one is done (it already work fine for other types if you disable the GC).

Fixes #2496
Fixes #2299
/cc @JeffBezanson

simonster · 2013-08-13T06:53:06Z

src/codegen.cpp

@@ -112,6 +115,13 @@ void __attribute__(()) __stack_chk_fail()
 static DIBuilder *dbuilder;
 static std::map<int, std::string> argNumberStrings;
 static FunctionPassManager *FPM;
+static PassManager *PM;


Am I overlooking something, or is this unused?

No, you're right. I used to have an inlining pass in there, but that didn't actually work.

timholy · 2013-08-13T09:42:28Z

This is huge! Really amazing work.

StefanKarpinski · 2013-08-13T15:51:27Z

Very cool stuff. Have you observed any major speedups due to this change?

JeffBezanson · 2013-08-13T15:53:41Z

You should benchmark complex sqrt and log, since they use ssqs which returns a tuple.

ViralBShah · 2013-08-13T18:03:13Z

Really cool!

Keno · 2013-08-13T23:02:40Z

Here's a fun one:

julia> foo(x::(Nothing,Nothing)) = x
# methods for generic function foo
foo(x::(Nothing,Nothing)) at none:1

julia> foo((nothing,nothing))
(nothing,nothing)

julia> code_llvm(foo,((Nothing,Nothing),))

define void @julia_foo631() {
top:
  ret void, !dbg !3461
}

ViralBShah · 2013-08-14T06:06:16Z

Marking this as 0.3 milestone.

Keno · 2013-08-16T20:03:18Z

Took longer than I had hoped, but I feel this is done now. Complex sqrt for example got about 2x faster.

lindahua · 2013-08-19T03:52:33Z

Excellent!

I think the llvmcall intrinsic is eventually unlocking the capability of using SIMD instructions? This would be a huge step.

ViralBShah · 2013-08-19T05:07:53Z

With the recent cleanup of floating point internals as part float16, it should be much easier to get simd capabilities. I hope we will have it in 0.3.

Keno · 2013-08-19T05:11:57Z

I should have been more clear. This also adds simd types. In particular the (Int32,Int32,Int32,Int32) in the example above is a SIMD vector. Note that the llvmcall expands to a packed add.

ViralBShah · 2013-08-19T05:42:49Z

That is really cool.

GunnarFarneback · 2013-08-19T07:29:43Z

I still need to figure out how to do SIMD loading and storing of data and I don't know what it takes to obtain more exotic SIMD instructions like palignr, but llvmcall can definitely be used to generate some SIMD instructions. Here's another teaser, computing x*y+z and producing very clean code on a Sandy Bridge processor with 64-bit Linux:

typealias Float32x4 (Float32,Float32,Float32,Float32)
function bar(x::Float32x4, y::Float32x4, z::Float32x4)
    xy = Base.llvmcall("""%3 = fmul <4 x float> %1, %0                                                                                                                                                  
ret <4 x float> %3""", Float32x4, (Float32x4, Float32x4), x, y)
    Base.llvmcall("""%3 = fadd <4 x float> %1, %0                                                                                                                                                       
ret <4 x float> %3""", Float32x4, (Float32x4, Float32x4), xy, z)
end

julia> code_native(bar, (Float32x4, Float32x4, Float32x4))
        .text
Filename: none
Source line: 4
        push    RBP
        mov     RBP, RSP
        vmulps  XMM0, XMM0, XMM1
        vaddps  XMM0, XMM0, XMM2
Source line: 4
        pop     RBP
        ret

256 bit SIMD with AVX also works:

typealias Float32x8 (Float32, Float32, Float32, Float32, Float32, Float32, Float32, Float32)
function bar(x::Float32x8, y::Float32x8, z::Float32x8)
    xy = Base.llvmcall("""%3 = fmul <8 x float> %1, %0                                                                                                                                                  
ret <8 x float> %3""", Float32x8, (Float32x8, Float32x8), x, y)
    Base.llvmcall("""%3 = fadd <8 x float> %1, %0                                                                                                                                                       
ret <8 x float> %3""", Float32x8, (Float32x8, Float32x8), xy, z)
end

julia> code_native(bar, (Float32x8, Float32x8, Float32x8))
        .text
Filename: none
Source line: 4
        push    RBP
        mov     RBP, RSP
        vmulps  YMM0, YMM0, YMM1
        vaddps  YMM0, YMM0, YMM2
Source line: 4
        pop     RBP
        ret

Keno · 2013-08-19T07:33:08Z

I should really make NTuple{Float32,8} work as a notation.

JeffBezanson · 2013-08-19T07:46:32Z

NTuple{8,Float32} already works; it is treated as the same type.

Keno · 2013-08-19T07:48:08Z

It didn't work when I tested it, but I will have another look.

lindahua · 2013-08-19T14:15:28Z

Just tested, NTuple{N,T} works, and NTuple{4, Int} is exactly just (Int, Int, Int, Int).

Looking forward to using llvmcall, then we can revive the discussion of SIMD design!

Keno · 2013-08-19T14:23:37Z

Must've been doing sth. wrong then. Thanks for testing.

Keno · 2013-10-05T01:34:06Z

Rebased, for those following along at home ;).

ArchRobison · 2013-11-07T19:21:20Z

Mapping tuples to the native vector datatypes is nifty. I want to try it out, but the fork is only working partially for me. I'm on a 64-bit Ubuntu 12.0 Linux box with a Haswell processor, and running into a problem. The symptom is that julia-readline exits after I start it.

$ /localdisk/adrobiso/simd/julia/usr/bin/julia-readline
$

If I put the the examples in an input file, it works fine. For example:

$ cat /tmp/a.jl
code_llvm(size,(Array{Int64,2},))
$ usr/bin/julia-readline /tmp/a.jl

define <2 x i64> @julia_size(%jl_value_t*) {
top:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 4, i32 0, !dbg !2775
  %2 = load %jl_value_t** %1, align 8, !dbg !2775
  %3 = ptrtoint %jl_value_t* %2 to i64, !dbg !2775
  %4 = insertelement <2 x i64> undef, i64 %3, i32 0, !dbg !2780, !julia_type !2781
  %5 = getelementptr inbounds %jl_value_t* %0, i64 5, i32 0, !dbg !2780
  %6 = load %jl_value_t** %5, align 8, !dbg !2780
  %7 = ptrtoint %jl_value_t* %6 to i64, !dbg !2780
  %8 = insertelement <2 x i64> %4, i64 %7, i32 1, !dbg !2780, !julia_type !2781
  ret <2 x i64> %8, !dbg !2780
}

Newbie question: What's the recommended way to debug failure of the REPL? I started using gdb on usr/bin/julia-debug-basic, but was getting lost when I hit the jl_apply calls.

staticfloat · 2013-11-07T19:28:47Z

Try running it inside of strace:

strace ./julia

That can sometimes print out interesting information.

ViralBShah · 2013-11-09T15:32:25Z

I believe this is going to be merged soon, as we will release 0.2 this weekend and branch.

Keno · 2013-11-15T21:33:04Z

@ArchRobison I hadn't noticed since I usually use my own REPL. That turns out to be an LLVM bug (sigh): http://llvm.org/bugs/show_bug.cgi?id=12618. I'll see if I can work around it.

timholy · 2013-11-15T21:41:54Z

Or we could just merge your REPL before we merge this patch? Hopefully we're talking a matter of days here...

Keno · 2013-11-17T00:55:31Z

Now, that 0.2 is tagged, can we finally merge this?

ihnorton · 2013-11-17T00:58:53Z

+1

StefanKarpinski · 2013-11-17T01:04:29Z

Yes. This is a non-breaking change, right? So does that mean it can be merged into the release-0.2 branch?

Conflicts: src/cgutils.cpp

Keno · 2013-11-30T07:35:03Z

@JeffBezanson, anything else? I'd love to merge this soon so I can put up some of the followup stuff.

Keno · 2013-12-05T22:01:50Z

@JeffBezanson bump. Let's get this merged soon.

Tuples made me sad (so I fixed them)

johnmyleswhite · 2013-12-07T03:28:28Z

This sped up the library I'm working on to 66% of its previous runtime. Well done!

timholy · 2013-12-07T03:33:09Z

oh boy oh boy oh boy oh boy

julia> include("testgrid.jl")  # tests interpolation in Grid.jl
elapsed time: 0.647127396 seconds (175658452 bytes allocated)

# Now git pull && make, restart julia

julia> include("testgrid.jl")
elapsed time: 0.339640071 seconds (48006676 bytes allocated)

We are finally on-par with Matlab's griddedInterpolant.

vtjnash · 2013-12-07T06:49:42Z

src/cgutils.cpp

+            if(iity->getBitWidth() > 32)
+                i = builder.CreateTrunc(i,T_int32);
+            else if(iity->getBitWidth() < 32)
+                i = builder.CreateZExt(i,T_int32);


theres also a CreateZExtOrTrunc function, although it just implements this same logic

Ah, yes. Fair point. I was trying to remember what it was called, but then again as you noted it doesn't really matter.

IainNZ · 2013-12-07T06:55:15Z

Prelim speed center results:

ATLAS build of
- bench_eu_devec
- bench_eu_vev
- mandel
- raytracer
- stockcorr
- became as fast as OpenBLAS times (why?! - also some ATLAS builds maybe got slower too...)
gk, go, parse_int, randn_zig got faster

ViralBShah · 2013-12-07T07:23:48Z

No change for sparse matvec. I was not expecting it, but tried it just in case.

ViralBShah · 2013-12-07T07:27:32Z

nbody_vec seems to have slowed down a lot. @staticfloat nbody_vec does not seem to be running on criid.

IainNZ · 2013-12-07T07:30:18Z

I should note I was looking at "criid"

ViralBShah · 2013-12-07T07:34:31Z

http://speed.julialang.org/timeline/?exe=1&base=none&ben=nbody_vec&env=2&revs=50&equid=off

staticfloat · 2013-12-07T10:21:33Z

@ViralBShah Already been reported; in fact, none of the shootout/ benchmarks are getting run, because we're dying somewhere in the process.

* upstream/master: (53 commits) edit embedding chapter formatting fixes Fix JuliaLang#5056 more consitent/useful description of predicate in ie all() NEWS for JuliaLang#4042 Add sparse matvec to perf benchmark (#4707) Use warn_once instead of warn (JuliaLang#5038) Use warn() instead of println in base/sprase/sparsematrix.jl allow scale(b,A) or scale(A,b) when b is a scalar as well as a vector, don't restrict scale unnecessarily to arrays of numbers (e.g. scaling arrays of arrays should work), and improve documentation for scale\! Added section about memory management added negative bitarray rolling More accurate linspace for types with greater precision than Float64 add realmin realmax for BigFloat fix eps(realmax), add typemin and typemax for BigFloat fix JuliaLang#5025, ordering used by nextfloat/prevfloat roll back on errors during type definitions. fixes JuliaLang#5009 improve method signature errors. fixes JuliaLang#5018 update juliadoc add more asserts for casts Fix segfaulting on strstr() failure ...

simonster reviewed Aug 13, 2013
View reviewed changes

ghost assigned JeffBezanson Sep 18, 2013

Keno mentioned this pull request Nov 1, 2013

SIMD types #2299

Closed

Keno added 8 commits November 29, 2013 02:43

Fix build failure after merge

71a5def

Cleanup

665ffe5

Separate out llvmcall

e02db98

Simplify emit_tupleset, but disallowing non-constant ones

9deb3a1

Add a missing GC root

ee26d21

Fix an improperly adjusted gc frame

f9999b1

Conflicts: src/cgutils.cpp

Fix vararg root counting

a7d99e6

Fix a rebase conflict

d5ad8e6

JeffBezanson added a commit that referenced this pull request Dec 7, 2013

Merge pull request #4042 from JuliaLang/kf/tuples

3db073d

Tuples made me sad (so I fixed them)

JeffBezanson merged commit 3db073d into master Dec 7, 2013

JeffBezanson mentioned this pull request Dec 7, 2013

Game of GO benchmark #1169

Closed

vtjnash reviewed Dec 7, 2013
View reviewed changes

Keno mentioned this pull request Dec 7, 2013

Add llvmcall #5046

Merged

timholy mentioned this pull request Dec 7, 2013

Possible memory leak? #5047

Closed

stevengj added a commit that referenced this pull request Dec 7, 2013

NEWS for #4042

c795690

jiahao mentioned this pull request Mar 16, 2014

ERROR: no method start(Nothing) in the presence of @show #6175

Closed

simonster mentioned this pull request Jul 11, 2014

Remove outdated size tweak from performance tips section of manual #7562

Merged

Keno deleted the kf/tuples branch August 10, 2014 18:47

GunnarFarneback mentioned this pull request Oct 26, 2014

A new RNG for Julia? #8786

Closed

Tuples made me sad (so I fixed them) #4042

Tuples made me sad (so I fixed them) #4042

Conversation

Keno commented Aug 13, 2013

simonster Aug 13, 2013

Choose a reason for hiding this comment

Keno Aug 13, 2013

Choose a reason for hiding this comment

timholy commented Aug 13, 2013

StefanKarpinski commented Aug 13, 2013

JeffBezanson commented Aug 13, 2013

ViralBShah commented Aug 13, 2013

Keno commented Aug 13, 2013

ViralBShah commented Aug 14, 2013

Keno commented Aug 16, 2013

lindahua commented Aug 19, 2013

ViralBShah commented Aug 19, 2013

Keno commented Aug 19, 2013

ViralBShah commented Aug 19, 2013

GunnarFarneback commented Aug 19, 2013

Keno commented Aug 19, 2013

JeffBezanson commented Aug 19, 2013

Keno commented Aug 19, 2013

lindahua commented Aug 19, 2013

Keno commented Aug 19, 2013

Keno commented Oct 5, 2013

ArchRobison commented Nov 7, 2013

staticfloat commented Nov 7, 2013

ViralBShah commented Nov 9, 2013

Keno commented Nov 15, 2013

timholy commented Nov 15, 2013

Keno commented Nov 17, 2013

ihnorton commented Nov 17, 2013

StefanKarpinski commented Nov 17, 2013

Keno commented Nov 30, 2013

Keno commented Dec 5, 2013

johnmyleswhite commented Dec 7, 2013

timholy commented Dec 7, 2013

vtjnash Dec 7, 2013

Choose a reason for hiding this comment

Keno Dec 7, 2013

Choose a reason for hiding this comment

IainNZ commented Dec 7, 2013

ViralBShah commented Dec 7, 2013

ViralBShah commented Dec 7, 2013

IainNZ commented Dec 7, 2013

ViralBShah commented Dec 7, 2013

staticfloat commented Dec 7, 2013