Permalink
Browse files

bitarray: use gen_cart_map in assign with ranges

  • Loading branch information...
1 parent f46ec21 commit f5ac9889da893238fa8c3de8a3ad5954dba9b596 @carlobaldassi carlobaldassi committed Apr 18, 2012
Showing with 82 additions and 14 deletions.
  1. +82 −14 extras/bitarray.jl
View
@@ -324,23 +324,91 @@ function ref{T<:Integer}(B::BitArray{T}, I::Integer...)
return B[index]
end
+# note: we can gain some performance if the first dimension is a range;
+# TODO: extend to I:Indices... (i.e. not necessarily contiguous)
let ref_cache = nothing
-global ref
-function ref(B::BitArray, I::Indices...)
- i = length(I)
- while i > 0 && isa(I[i],Integer); i-=1; end
- d = map(length, I[1:i])::Dims
- X = similar(B, d)
+ global ref
+ function ref(B::BitArray, I0::Range1{Int}, I::Union(Integer, Range1{Int})...)
+ nI = 1 + length(I)
+ # this will be uncommented when
+ # the indexing behaviour is settled
+ #if ndims(B) != nI
+ #error("wrong number of dimensions in ref")
+ #end
+ while nI > 1 && isa(I[nI-1],Integer); nI-=1; end
+ d = tuple(length(I0), map(length, I[1:nI-1])...)::Dims
+ X = similar(B, d)
+
+ I = map(x->(isa(x,Integer) ? (x:x) : x), I[1:nI-1])
+
+ f0 = first(I0)
+ l0 = length(I0)
+ if nI == 1
+ _jl_copy_chunks(X.chunks, 1, B.chunks, f0, l0)
+ return X
+ end
+ if is(ref_cache,nothing)
+ ref_cache = HashTable()
+ end
+ f_lst = [first(r)::Int | r in I]
+ l_lst = [last(r)::Int | r in I]
+ ind_lst = copy(f_lst)
+ gap_lst = l_lst - f_lst + 1
+ stride_lst = Array(Int, nI - 1)
+ stride = 1
+ ind = f0
+ for k = 1 : nI - 1
+ stride *= size(B, k)
+ stride_lst[k] = stride
+ ind += stride * (ind_lst[k] - 1)
+ gap_lst[k] *= stride
+ end
+ reverse!(stride_lst)
+ reverse!(gap_lst)
- if is(ref_cache,nothing)
- ref_cache = HashTable()
+ genbodies = cell(nI, 2)
@JeffBezanson

JeffBezanson Apr 18, 2012

Owner

You want to delay this stuff by putting it inside e.g. genbodies = function (ivars) ... end, since otherwise it happens on every call, even if there is a cached version of the needed function.

@carlobaldassi

carlobaldassi Apr 18, 2012

Member

done, thanks.

+ genbodies[1] = quote
+ _jl_copy_chunks(X.chunks, storeind, B.chunks, ind, l0)
+ storeind += l0
+ ind += stride_lst[loop_ind]
+ end
+ for k = 2 : nI
+ genbodies[k, 1] = quote
+ loop_ind += 1
+ end
+ genbodies[k, 2] = quote
+ ind -= gap_lst[loop_ind]
+ loop_ind -= 1
+ if loop_ind > 0
+ ind += stride_lst[loop_ind]
+ end
+ end
+ end
+ gen_cartesian_map(ref_cache,
+ ivars->genbodies,
+ I, (:B, :X, :storeind, :ind, :l0, :stride_lst, :gap_lst, :loop_ind),
+ B, X, 1, ind, l0, stride_lst, gap_lst, 0)
+ return X
end
- gen_cartesian_map(ref_cache, ivars -> quote
- X[storeind] = B[$(ivars...)]
- storeind += 1
- end, I, (:B, :X, :storeind), B, X, 1)
- return X
end
+
+let ref_cache = nothing
+ global ref
+ function ref(B::BitArray, I::Indices...)
+ i = length(I)
+ while i > 0 && isa(I[i],Integer); i-=1; end
+ d = map(length, I[1:i])::Dims
+ X = similar(B, d)
+
+ if is(ref_cache,nothing)
+ ref_cache = HashTable()
+ end
+ gen_cartesian_map(ref_cache, ivars -> quote
+ X[storeind] = B[$(ivars...)]
+ storeind += 1
+ end, I, (:B, :X, :storeind), B, X, 1)
+ return X
+ end
end
# logical indexing
@@ -462,7 +530,7 @@ let assign_cache = nothing
genbodies[k, 2] = quote
ind -= gap_lst[loop_ind]
loop_ind -= 1
- if (loop_ind > 0)
+ if loop_ind > 0
ind += stride_lst[loop_ind]
end
end

14 comments on commit f5ac988

Owner

JeffBezanson commented on f5ac988 Apr 18, 2012

You should use the special bit string functions like count_ones if you don't already.

Member

carlobaldassi replied Apr 18, 2012

I used count_ones in nnz in place of my own bit-voodoo code and the speedup is dramatic, about 10x !!
I didn't find any other use for the special functions in int.jl for bitarrays though. Given the example of count_ones, a bit-swapping function may be useful in reverse (there's only one which reverts bytes) and maybe even one for the 8x8 bit transpose (though it's possible that in that case the overhead of the other operations would reduce the advantage).

Member

carlobaldassi replied Apr 18, 2012

Also, the message of this commit is wrong, should have been "use gen_cart_map in ref with ranges"! sorry...

Member

carlobaldassi replied Apr 18, 2012

Ok, now I saw that llvm doesn't offer any bit string intrinsics other than those already provided by int.jl; I also verified that calling a C implementation of the 8x8 bit transpose doesn't give any advantage over the one written in Julia (and both are way slower than compiled C code, say 50x slower). It seems to me there's not much to be done on this side; I don't know if you had something else in mind.

Owner

JeffBezanson replied Apr 18, 2012

This seems to say that C is slower than C, can you clarify?

Member

carlobaldassi replied Apr 19, 2012

Yes, that's what it seems I said indeed :) In fact, I wasn't careful enough in doing the benchmarks, and these functions are so fast that any detail can screw up the timings and hide the true results (I've been basically measuring some overhead).
So I did the benchmarks afresh and indeed the shared library version of the bit transpose is much faster. Here's the Julia one:

julia> h(n, x)=for i=1:n; _jl_transpose8rS64(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x);
elapsed time: 0.48500514030456543 seconds

and the C:

julia> _jl_bitops = dlopen("libbit_ops"); ctr = dlsym(_jl_bitops, :transpose8rS64);

julia> _jl_transpose8x8(x::Uint64) = ccall(ctr, Uint64, (Uint64,), x)

julia> g(n, x)=for i=1:n; _jl_transpose8x8(x); end; x=0x98a714fde8ef967b; n=10^7; @time g(n,x);
elapsed time: 0.06575798988342285 seconds

(I ran everything multiple times to be sure I wasn't measuting compilation etc.)

For reference, here's the code for the two functions:

function _jl_transpose8rS64(x::Uint64)
   t = (x $ (x >>> 7)) & 0x00aa00aa00aa00aa;
   x = x $ t $ (t << 7);
   t = (x $ (x >>> 14)) & 0x0000cccc0000cccc
   x = x $ t $ (t << 14);
   t = (x $ (x >>> 28)) & 0x00000000f0f0f0f0
   x = x $ t $ (t << 28);
   return x
end
extern unsigned long long transpose8rS64(unsigned long long x)
{
   unsigned long long t;

   t = (x ^ (x >> 7)) & 0x00AA00AA00AA00AALL;
   x = x ^ t ^ (t << 7);
   t = (x ^ (x >> 14)) & 0x0000CCCC0000CCCCLL;
   x = x ^ t ^ (t << 14);
   t = (x ^ (x >> 28)) & 0x00000000F0F0F0F0LL;
   x = x ^ t ^ (t << 28);

   return x;
}

I wonder if there's anything which can be done about that. I think putting the bit string operations in a shared library would sound like a defeat, that's not an option!

Addendum: it still seems that there's some ccall overhead in the shared library call from Julia, since this C code:

extern unsigned long long transpose8rS64(unsigned long long x)
{
        // the above function here
}

int main(int argc, char **argv)
{
    int i;
    unsigned long long x = 0x98a714fde8ef967b;
    for (i = 0; i < 1e7; ++i) x = transpose8rS64(x);
    return 0;
}

compiled with -O3 executes in about one third of the time, but I guess that's to be expected. I didn't try linking dynamically from C. I checked everything with n=10^8 and 10^9 and the timings scale up linearly as expected.

Member

carlobaldassi replied Apr 19, 2012

Here's something interesting: the previous tests with C code were compiled with gcc -O3. Compiling the shared library with gcc -O0 gives much worse perfromance than Julia native code (1.25s vs 0.48s). EDIT: the following is wrong. But: Compiling the shared library with clang -O0 is faster than native Julia (0.13s vs 0.48s). However, compiling the shared library with clang -O3 yields worse performance, compatible with native Julia! (I'm using clang version 2.9. Using -O1 or -O2 gives the same results as -O3.) up to here

More Fun Facts: compiling the C executable with clang -O0 is slower than invoking the clang-O0-compiled shared library from within Julia (0.25s, I guess llvm in Julia is unrolling the for loop or something). Compiling the C executable with clang -O3 gives 0 execution time because clang is smart enough to understand that the code in main is pointless :)

Member

carlobaldassi replied Apr 19, 2012

Ok sorry again, disregard the previous silly post, it was just a typo...

Member

carlobaldassi replied Apr 19, 2012

It seems the bottleneck is...the assignment! See the following:

julia> v(x)=(2x)

julia> h(n, x)=for i=1:n; v(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x);
elapsed time: 0.01648998260498047 seconds

julia> v(x)=(x=2x)

julia> h(n, x)=for i=1:n; v(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x);
elapsed time: 0.4615170955657959 seconds

Also, this horrible code is a version of the bittranspose function without assignments:

function btr(x)
   ((((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0xCCCC3333CCCC3333)|
      ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0x0000CCCC0000CCCC)<<14)|
      ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))>>14)&0x0000CCCC0000CCCC))&0xF0F0F0F00F0F0F0F)|
    ((((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0xCCCC3333CCCC3333)|
       ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0x0000CCCC0000CCCC)<<14)|
       ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))>>14)&0x0000CCCC0000CCCC))&0x00000000F0F0F0F0)<<28)|
    ((((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0xCCCC3333CCCC3333)|
       ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0x0000CCCC0000CCCC)<<14)|
       ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))>>14)&0x0000CCCC0000CCCC))>>28)&0x00000000F0F0F0F0))
end

yes it's disgusting, but:

julia> h(n, x)=for i=1:n; btr(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x);
elapsed time: 0.020379066467285156 seconds

slightly less disturbing:

_t1(x) = (x $ (x >>> 7)) & 0x00aa00aa00aa00aa
_x1(x) = x $ _t1(x) $ (_t1(x) << 7)
_t2(x) = (_x1(x) $ (_x1(x) >>> 14)) & 0x0000cccc0000cccc
_x2(x) = _x1(x) $ _t2(x) $ (_t2(x) << 14)
_t3(x) = (_x2(x) $ (_x2(x) >>> 28)) & 0x00000000f0f0f0f0
btr2(x) = _x2(x) $ _t3(x) $ (_t3(x) << 28)

(this is the exact translation of _jl_transpose8r64 above with function calls in place of assignments.) Again:

julia> h(n, x)=for i=1:n; btr2(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x);
elapsed time: 0.015455007553100586 seconds
Owner

JeffBezanson replied Apr 19, 2012

Fascinating! There must be an unnecessary box operation happening on the assignment. It probably has to do with assigning to an argument, which isn't too common.

Member

carlobaldassi replied Apr 19, 2012

I thought of that too but no, it seems any assignment triggers some kind of overhead (it doesn't matter if it's 1 or 10 assignments, the timings are the same). Also, I tried all kinds of type assert to no avail.

Owner

JeffBezanson replied Apr 19, 2012

Ah, I think all the versions without assignment are getting fully inlined into the loop, while the others do function calls. Still, the overhead is way too much. I will work on this.

Member

carlobaldassi replied Apr 20, 2012

Yes, that's it! I tried inlining manually by pasting code directly in the loop and indeed I recover the kind of performance I get with the shared library, and in fact even better (see below)! Currently, the best option I could come up with in a more realistic setting is to use the following macro instead of a function:

macro btr6(x)
    t,y=gensym(2)
    quote
        $y = $x
        $t = ($y $ ($y >>> 7)) & 0x00aa00aa00aa00aa;
        $y = $y $ $t $ ($t << 7);
        $t = ($y $ ($y >>> 14)) & 0x0000cccc0000cccc
        $y = $y $ $t $ ($t << 14);
        $t = ($y $ ($y >>> 28)) & 0x00000000f0f0f0f0
        $x = $y $ $t $ ($t << 28);
    end
end

with this, I ran the following test function:

g(n, x)=for i=1:n; @btr6 x[i]; end;
h(n, x)=for i=1:n; x[i]=_jl_transpose8x8(x[i]); end; #ccalls the shared lib

function trtest()
    n=10^7;
    x=[randi(Uint64)|j=1:n];
    println("macro:")
    @time g(n,x);
    println("shared library:")
    @time h(n,x);
end

and here's the staggering result:

julia> trtest()
macro:
elapsed time: 0.07177591323852539 seconds
shared library:
elapsed time: 0.08093500137329102 seconds

So the macro version is even faster than the ccall'd one!!
(This is no artifact, I tested a whole range of conditions, tried the two versions directly in the repl with global variables, scaled n and whatnot: the result is stable, and the performance gap is well above statistical fluctuations.)

Additional notes:

  1. gensym doesn't affect performance
  2. all other versions, including all the assignment-less version I reported above, have noticeably worse performance in this more realistic scenario where values are actually passed around (of course the manually-inlined code has the same performance as the macro).
  3. compiling the shared library with gcc or clang doesn't change much (the above result is for gcc, which seemed to give slightly faster timings).
  4. inlining the ccall doesn't change anything
Owner

JeffBezanson replied Apr 20, 2012

Very good analysis! LLVM's JIT does generate good code, and calling a C function will certainly have non-zero overhead.

Please sign in to comment.