# JuliaLang/julia

bitarray: use gen_cart_map in assign with ranges

1 parent f46ec21 commit f5ac9889da893238fa8c3de8a3ad5954dba9b596 carlobaldassi committed Apr 18, 2012
Showing with 82 additions and 14 deletions.
1. +82 −14 extras/bitarray.jl
@@ -324,23 +324,91 @@ function ref{T<:Integer}(B::BitArray{T}, I::Integer...)
return B[index]
end
+# note: we can gain some performance if the first dimension is a range;
+# TODO: extend to I:Indices... (i.e. not necessarily contiguous)
let ref_cache = nothing
-global ref
-function ref(B::BitArray, I::Indices...)
- i = length(I)
- while i > 0 && isa(I[i],Integer); i-=1; end
- d = map(length, I[1:i])::Dims
- X = similar(B, d)
+ global ref
+ function ref(B::BitArray, I0::Range1{Int}, I::Union(Integer, Range1{Int})...)
+ nI = 1 + length(I)
+ # this will be uncommented when
+ # the indexing behaviour is settled
+ #if ndims(B) != nI
+ #error("wrong number of dimensions in ref")
+ #end
+ while nI > 1 && isa(I[nI-1],Integer); nI-=1; end
+ d = tuple(length(I0), map(length, I[1:nI-1])...)::Dims
+ X = similar(B, d)
+
+ I = map(x->(isa(x,Integer) ? (x:x) : x), I[1:nI-1])
+
+ f0 = first(I0)
+ l0 = length(I0)
+ if nI == 1
+ _jl_copy_chunks(X.chunks, 1, B.chunks, f0, l0)
+ return X
+ end
+ if is(ref_cache,nothing)
+ ref_cache = HashTable()
+ end
+ f_lst = [first(r)::Int | r in I]
+ l_lst = [last(r)::Int | r in I]
+ ind_lst = copy(f_lst)
+ gap_lst = l_lst - f_lst + 1
+ stride_lst = Array(Int, nI - 1)
+ stride = 1
+ ind = f0
+ for k = 1 : nI - 1
+ stride *= size(B, k)
+ stride_lst[k] = stride
+ ind += stride * (ind_lst[k] - 1)
+ gap_lst[k] *= stride
+ end
+ reverse!(stride_lst)
+ reverse!(gap_lst)
- if is(ref_cache,nothing)
- ref_cache = HashTable()
+ genbodies = cell(nI, 2)

#### JeffBezanson Apr 18, 2012

Owner

You want to delay this stuff by putting it inside e.g. `genbodies = function (ivars) ... end`, since otherwise it happens on every call, even if there is a cached version of the needed function.

#### carlobaldassi Apr 18, 2012

Member

done, thanks.

+ genbodies[1] = quote
+ _jl_copy_chunks(X.chunks, storeind, B.chunks, ind, l0)
+ storeind += l0
+ ind += stride_lst[loop_ind]
+ end
+ for k = 2 : nI
+ genbodies[k, 1] = quote
+ loop_ind += 1
+ end
+ genbodies[k, 2] = quote
+ ind -= gap_lst[loop_ind]
+ loop_ind -= 1
+ if loop_ind > 0
+ ind += stride_lst[loop_ind]
+ end
+ end
+ end
+ gen_cartesian_map(ref_cache,
+ ivars->genbodies,
+ I, (:B, :X, :storeind, :ind, :l0, :stride_lst, :gap_lst, :loop_ind),
+ B, X, 1, ind, l0, stride_lst, gap_lst, 0)
+ return X
end
- gen_cartesian_map(ref_cache, ivars -> quote
- X[storeind] = B[\$(ivars...)]
- storeind += 1
- end, I, (:B, :X, :storeind), B, X, 1)
- return X
end
+
+let ref_cache = nothing
+ global ref
+ function ref(B::BitArray, I::Indices...)
+ i = length(I)
+ while i > 0 && isa(I[i],Integer); i-=1; end
+ d = map(length, I[1:i])::Dims
+ X = similar(B, d)
+
+ if is(ref_cache,nothing)
+ ref_cache = HashTable()
+ end
+ gen_cartesian_map(ref_cache, ivars -> quote
+ X[storeind] = B[\$(ivars...)]
+ storeind += 1
+ end, I, (:B, :X, :storeind), B, X, 1)
+ return X
+ end
end
# logical indexing
@@ -462,7 +530,7 @@ let assign_cache = nothing
genbodies[k, 2] = quote
ind -= gap_lst[loop_ind]
loop_ind -= 1
- if (loop_ind > 0)
+ if loop_ind > 0
ind += stride_lst[loop_ind]
end
end

Owner

### JeffBezanson commented on `f5ac988`Apr 18, 2012

 You should use the special bit string functions like `count_ones` if you don't already.
Member

### carlobaldassi replied Apr 18, 2012

 I used `count_ones` in `nnz` in place of my own bit-voodoo code and the speedup is dramatic, about 10x !! I didn't find any other use for the special functions in `int.jl` for bitarrays though. Given the example of `count_ones`, a bit-swapping function may be useful in `reverse` (there's only one which reverts bytes) and maybe even one for the 8x8 bit transpose (though it's possible that in that case the overhead of the other operations would reduce the advantage).
Member

### carlobaldassi replied Apr 18, 2012

 Also, the message of this commit is wrong, should have been "use gen_cart_map in ref with ranges"! sorry...
Member

### carlobaldassi replied Apr 18, 2012

 Ok, now I saw that llvm doesn't offer any bit string intrinsics other than those already provided by int.jl; I also verified that calling a C implementation of the 8x8 bit transpose doesn't give any advantage over the one written in Julia (and both are way slower than compiled C code, say 50x slower). It seems to me there's not much to be done on this side; I don't know if you had something else in mind.
Owner

### JeffBezanson replied Apr 18, 2012

 This seems to say that C is slower than C, can you clarify?
Member

### carlobaldassi replied Apr 19, 2012

 Yes, that's what it seems I said indeed :) In fact, I wasn't careful enough in doing the benchmarks, and these functions are so fast that any detail can screw up the timings and hide the true results (I've been basically measuring some overhead). So I did the benchmarks afresh and indeed the shared library version of the bit transpose is much faster. Here's the Julia one: ``````julia> h(n, x)=for i=1:n; _jl_transpose8rS64(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x); elapsed time: 0.48500514030456543 seconds `````` and the C: ``````julia> _jl_bitops = dlopen("libbit_ops"); ctr = dlsym(_jl_bitops, :transpose8rS64); julia> _jl_transpose8x8(x::Uint64) = ccall(ctr, Uint64, (Uint64,), x) julia> g(n, x)=for i=1:n; _jl_transpose8x8(x); end; x=0x98a714fde8ef967b; n=10^7; @time g(n,x); elapsed time: 0.06575798988342285 seconds `````` (I ran everything multiple times to be sure I wasn't measuting compilation etc.) For reference, here's the code for the two functions: ```function _jl_transpose8rS64(x::Uint64) t = (x \$ (x >>> 7)) & 0x00aa00aa00aa00aa; x = x \$ t \$ (t << 7); t = (x \$ (x >>> 14)) & 0x0000cccc0000cccc x = x \$ t \$ (t << 14); t = (x \$ (x >>> 28)) & 0x00000000f0f0f0f0 x = x \$ t \$ (t << 28); return x end``` ```extern unsigned long long transpose8rS64(unsigned long long x) { unsigned long long t; t = (x ^ (x >> 7)) & 0x00AA00AA00AA00AALL; x = x ^ t ^ (t << 7); t = (x ^ (x >> 14)) & 0x0000CCCC0000CCCCLL; x = x ^ t ^ (t << 14); t = (x ^ (x >> 28)) & 0x00000000F0F0F0F0LL; x = x ^ t ^ (t << 28); return x; }``` I wonder if there's anything which can be done about that. I think putting the bit string operations in a shared library would sound like a defeat, that's not an option! Addendum: it still seems that there's some `ccall` overhead in the shared library call from Julia, since this C code: ```extern unsigned long long transpose8rS64(unsigned long long x) { // the above function here } int main(int argc, char **argv) { int i; unsigned long long x = 0x98a714fde8ef967b; for (i = 0; i < 1e7; ++i) x = transpose8rS64(x); return 0; }``` compiled with -O3 executes in about one third of the time, but I guess that's to be expected. I didn't try linking dynamically from C. I checked everything with n=10^8 and 10^9 and the timings scale up linearly as expected.
Member

### carlobaldassi replied Apr 19, 2012

 Here's something interesting: the previous tests with C code were compiled with gcc -O3. Compiling the shared library with gcc -O0 gives much worse perfromance than Julia native code (1.25s vs 0.48s). EDIT: the following is wrong. But: Compiling the shared library with clang -O0 is faster than native Julia (0.13s vs 0.48s). However, compiling the shared library with clang -O3 yields worse performance, compatible with native Julia! (I'm using clang version 2.9. Using -O1 or -O2 gives the same results as -O3.) up to here More Fun Facts: compiling the C executable with clang -O0 is slower than invoking the clang-O0-compiled shared library from within Julia (0.25s, I guess llvm in Julia is unrolling the for loop or something). Compiling the C executable with clang -O3 gives 0 execution time because clang is smart enough to understand that the code in main is pointless :)
Member

### carlobaldassi replied Apr 19, 2012

 Ok sorry again, disregard the previous silly post, it was just a typo...
Member

### carlobaldassi replied Apr 19, 2012

 It seems the bottleneck is...the assignment! See the following: ``````julia> v(x)=(2x) julia> h(n, x)=for i=1:n; v(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x); elapsed time: 0.01648998260498047 seconds julia> v(x)=(x=2x) julia> h(n, x)=for i=1:n; v(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x); elapsed time: 0.4615170955657959 seconds `````` Also, this horrible code is a version of the bittranspose function without assignments: ```function btr(x) ((((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0xCCCC3333CCCC3333)| ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0x0000CCCC0000CCCC)<<14)| ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))>>14)&0x0000CCCC0000CCCC))&0xF0F0F0F00F0F0F0F)| ((((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0xCCCC3333CCCC3333)| ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0x0000CCCC0000CCCC)<<14)| ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))>>14)&0x0000CCCC0000CCCC))&0x00000000F0F0F0F0)<<28)| ((((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0xCCCC3333CCCC3333)| ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))&0x0000CCCC0000CCCC)<<14)| ((((x&0xAA55AA55AA55AA55)|((x&0x00AA00AA00AA00AA)<<7)|((x>>7)&0x00AA00AA00AA00AA))>>14)&0x0000CCCC0000CCCC))>>28)&0x00000000F0F0F0F0)) end``` yes it's disgusting, but: ``````julia> h(n, x)=for i=1:n; btr(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x); elapsed time: 0.020379066467285156 seconds `````` slightly less disturbing: ```_t1(x) = (x \$ (x >>> 7)) & 0x00aa00aa00aa00aa _x1(x) = x \$ _t1(x) \$ (_t1(x) << 7) _t2(x) = (_x1(x) \$ (_x1(x) >>> 14)) & 0x0000cccc0000cccc _x2(x) = _x1(x) \$ _t2(x) \$ (_t2(x) << 14) _t3(x) = (_x2(x) \$ (_x2(x) >>> 28)) & 0x00000000f0f0f0f0 btr2(x) = _x2(x) \$ _t3(x) \$ (_t3(x) << 28)``` (this is the exact translation of `_jl_transpose8r64` above with function calls in place of assignments.) Again: ``````julia> h(n, x)=for i=1:n; btr2(x); end; x=0x98a714fde8ef967b; n=10^7; @time h(n,x); elapsed time: 0.015455007553100586 seconds ``````
Owner

### JeffBezanson replied Apr 19, 2012

 Fascinating! There must be an unnecessary box operation happening on the assignment. It probably has to do with assigning to an argument, which isn't too common.
Member

### carlobaldassi replied Apr 19, 2012

 I thought of that too but no, it seems any assignment triggers some kind of overhead (it doesn't matter if it's 1 or 10 assignments, the timings are the same). Also, I tried all kinds of type assert to no avail.
Owner

### JeffBezanson replied Apr 19, 2012

 Ah, I think all the versions without assignment are getting fully inlined into the loop, while the others do function calls. Still, the overhead is way too much. I will work on this.
Member

### carlobaldassi replied Apr 20, 2012

 Yes, that's it! I tried inlining manually by pasting code directly in the loop and indeed I recover the kind of performance I get with the shared library, and in fact even better (see below)! Currently, the best option I could come up with in a more realistic setting is to use the following macro instead of a function: ```macro btr6(x) t,y=gensym(2) quote \$y = \$x \$t = (\$y \$ (\$y >>> 7)) & 0x00aa00aa00aa00aa; \$y = \$y \$ \$t \$ (\$t << 7); \$t = (\$y \$ (\$y >>> 14)) & 0x0000cccc0000cccc \$y = \$y \$ \$t \$ (\$t << 14); \$t = (\$y \$ (\$y >>> 28)) & 0x00000000f0f0f0f0 \$x = \$y \$ \$t \$ (\$t << 28); end end``` with this, I ran the following test function: ```g(n, x)=for i=1:n; @btr6 x[i]; end; h(n, x)=for i=1:n; x[i]=_jl_transpose8x8(x[i]); end; #ccalls the shared lib function trtest() n=10^7; x=[randi(Uint64)|j=1:n]; println("macro:") @time g(n,x); println("shared library:") @time h(n,x); end``` and here's the staggering result: ``````julia> trtest() macro: elapsed time: 0.07177591323852539 seconds shared library: elapsed time: 0.08093500137329102 seconds `````` So the macro version is even faster than the `ccall`'d one!! (This is no artifact, I tested a whole range of conditions, tried the two versions directly in the repl with global variables, scaled `n` and whatnot: the result is stable, and the performance gap is well above statistical fluctuations.) Additional notes: gensym doesn't affect performance all other versions, including all the assignment-less version I reported above, have noticeably worse performance in this more realistic scenario where values are actually passed around (of course the manually-inlined code has the same performance as the macro). compiling the shared library with gcc or clang doesn't change much (the above result is for gcc, which seemed to give slightly faster timings). inlining the `ccall` doesn't change anything
Owner

### JeffBezanson replied Apr 20, 2012

 Very good analysis! LLVM's JIT does generate good code, and calling a C function will certainly have non-zero overhead.