Speed up scalar BitArray indexing by ~25% #11403

mbauman · 2015-05-22T17:15:09Z

This works around issue #9974 for BitArray indexing. BitArrays use inlined helper functions, unsafe_bit(get|set)index, to do the dirty work of picking bits out of the chunks array. Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to arrayref, which is special-cased).

The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as checkbounds) with #10525 or a subset thereof.

This works around issue #9974 for BitArray indexing. BitArrays use inlined helper functions, `unsafe_bit(get|set)index`, to do the dirty work of picking bits out of the chunks array. Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to `arrayref`, which is special-cased). The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as `checkbounds`) with #10525 or a subset thereof.

carlobaldassi · 2015-05-22T18:17:42Z

Wow. When I first wrote those functions, I remember I checked that extracting the chunks array before entering the loops was actually better. This is good news.

carlobaldassi · 2015-05-22T18:21:43Z

Does this fill the gap for the in function in your IntSet refactoring (#10065)?

mbauman · 2015-05-22T18:29:55Z

I haven't tested it yet. I imagine it'll help, but I don't think it will get the whole way there because it still has the same issue that this PR is working around (#9974), just one level up. So there's still a GC frame in the way.

I might try making BitArrays or IntSets immutable with Ref fields…

(Thanks for the review, I'll make those changes. 👍).

simonster · 2015-05-22T19:17:09Z

Maybe I'm not thinking this through properly, but shouldn't the overhead for the GC root be constant WRT the number of elements in the array? Whereas if we don't extract chunks first, then I'd think we need to do pointer loads for each element in the array if memory is modified inside a loop (#9755).

mbauman · 2015-05-22T20:07:54Z

That's a very good point, but that doesn't seem to be the case. I'm not sure why it's different from the Sparse idiom, and I don't read LLVM well enough to see where the load is happening. See, e.g., @code_llvm Base.unsafe_getindex(falses(10), [1,2,3]).

I was measuring this with a pared down version of the array indexing perf suite from #10525:

Before:

julia,sumeltIs BitArray,2.289962,4.516089,2.510938,0.475899
julia,sumeachIs BitArray,2.392486,4.950358,2.636985,0.518681
julia,sumlinearIs BitArray,2.392490,4.723534,2.651031,0.516554
julia,sumrangeIs BitArray,55.735734,68.236330,61.076764,3.231782
julia,sumlogicalIs BitArray,76.457638,89.633972,82.705620,3.218650
julia,sumvectorIs BitArray,52.035703,64.384697,57.298191,3.329376
julia,sumeltIb BitArray,20.508571,29.376177,23.304242,2.412258
julia,sumeachIb BitArray,20.551273,28.811882,24.374394,2.373606
julia,sumlinearIb BitArray,20.658381,28.198992,24.241191,1.996451
julia,sumrangeIb BitArray,7.435534,13.135764,8.971654,1.352756
julia,sumlogicalIb BitArray,44.855952,59.534208,51.114627,3.604198
julia,sumvectorIb BitArray,4.995062,9.224450,5.736706,0.989800

After:

julia,sumeltIs BitArray,2.119090,4.205546,2.326048,0.446215
julia,sumeachIs BitArray,1.798431,3.784359,1.988818,0.391689
julia,sumlinearIs BitArray,1.798433,3.726707,2.003796,0.412508
julia,sumrangeIs BitArray,54.409144,66.483328,59.816160,3.582779
julia,sumlogicalIs BitArray,73.263354,88.265986,81.087132,3.389352
julia,sumvectorIs BitArray,51.502689,61.310434,56.091181,2.488938
julia,sumeltIb BitArray,20.508759,26.162578,22.219579,1.974766
julia,sumeachIb BitArray,18.990501,25.145888,21.125213,2.032431
julia,sumlinearIb BitArray,18.989782,27.432469,21.050434,2.136499
julia,sumrangeIb BitArray,7.405507,13.296387,8.770047,1.395887
julia,sumlogicalIb BitArray,48.416468,58.620361,52.583693,2.959972
julia,sumvectorIb BitArray,4.891553,9.179898,5.628998,0.960776

mbauman · 2015-05-22T20:28:50Z

~~Ah, with very large arrays (3000x5000) I do see a slight (~10%) regression in logical indexing. In the tests above the array sizes are 3x5 (Is) and 300x500 (Ib).~~ Edit: I think this was a fluke.

I know nothing about this, but could this be TBAA doing its job (I'm on LLVM 3.3)? That would explain why the performance here has changed since @carlobaldassi did his tests.

simonster · 2015-05-22T20:33:16Z

A simple sum doesn't have stores in the inner loop, so LLVM ought to hoist the loads there regardless. I wonder how much things change if you sum into an element of an array rather than a variable.

mbauman · 2015-05-22T20:36:37Z

Sure, but what about the inner loops of the getindex function itself? For the non-scalar indexing cases, it's loading into a second BitArray.

simonster · 2015-05-23T14:36:58Z

I benchmarked this:

function f(x::BitVector, y::Vector{Int})
    chunks = x.chunks
    for i = 1:length(x)
        @inbounds y[i] += Base.unsafe_bitgetindex(chunks, i)
    end
end

a = falses(1000);
b = [1];
for i = 1:1; f(a, b); end
@time for i = 1:10000; f(a, b); end
a = falses(10000);
b = [1];
@time for i = 1:10000; f(a, b); end
a = falses(100000);
b = [1];
@time for i = 1:10000; f(a, b); end
end

On eb5da26 I get:

   1.133 seconds     
   1.142 seconds     
   1.141 seconds

and with this PR (where I obviously don't extract chunks in f):

   1.414 seconds     
   1.401 seconds     
   1.452 seconds

With this PR, there is no GC root, but there are two extra loads in the inner loop, since the chunks field and array pointer need to be reloaded on each iteration.

If I accumulate into a variable instead of the array, then I don't actually see a substantial difference for this benchmark. The inner loop IR is the same. I'd expect a performance advantage for this PR for very small cases, but it doesn't seem to be larger than the variability in my benchmarking for 1000 elements.

mbauman · 2015-05-23T14:44:04Z

Thanks so much for doing those tests, Simon. This clearly isn't the answer.

jebej · 2019-08-23T18:35:42Z

@mbauman I was wondering why BitArrays were mutable and found this PR. Did you end up experimenting with making them immutable?

It seems that only the len field would need to be a reference for everything to work.

mbauman · 2019-08-23T18:42:32Z

I did a bit more experimentation in #11430, but I've not looked at it since. As I recall, the major performance hit was because using a Ref makes that field a separate object that is separately allocated and potentially far away in memory. This cost is paid multiple times — once at construction, but then by the way of cache misses every time you access it.

mbauman closed this May 23, 2015

mbauman mentioned this pull request May 24, 2015

WIP: Allow constant fields in mutable types #11430

Closed

6 tasks

mbauman deleted the mb/fasterbits branch June 8, 2015 20:21

This was referenced Jul 29, 2015

Improve prime sieve performance #12025

Merged

Introduce Buffer type and make Array an abstraction on top of it #12447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up scalar BitArray indexing by ~25% #11403

Speed up scalar BitArray indexing by ~25% #11403

mbauman commented May 22, 2015

carlobaldassi commented May 22, 2015

carlobaldassi commented May 22, 2015

mbauman commented May 22, 2015

simonster commented May 22, 2015

mbauman commented May 22, 2015

mbauman commented May 22, 2015

simonster commented May 22, 2015

mbauman commented May 22, 2015

simonster commented May 23, 2015

mbauman commented May 23, 2015

jebej commented Aug 23, 2019

mbauman commented Aug 23, 2019

Speed up scalar BitArray indexing by ~25% #11403

Speed up scalar BitArray indexing by ~25% #11403

Conversation

mbauman commented May 22, 2015

carlobaldassi commented May 22, 2015

carlobaldassi commented May 22, 2015

mbauman commented May 22, 2015

simonster commented May 22, 2015

mbauman commented May 22, 2015

mbauman commented May 22, 2015

simonster commented May 22, 2015

mbauman commented May 22, 2015

simonster commented May 23, 2015

mbauman commented May 23, 2015

jebej commented Aug 23, 2019

mbauman commented Aug 23, 2019