groupsort for bcolz #76
Comments
Note: I see carrays are still in memory |
See also http://www.slideshare.net/cvaartjes/bcolz-groupby-discussion-document So we have factorizing in place. If we look at page 4 of the doc, we apply first the factorization for all groupby columns.
so it calculates the theoretical amount of possible groups: ngroups (see page 7 of the document):
and then it calculates the group_index (page 8 of the document), with this code:
The group_index here is what I called groupby_input in slide 8. NB: some numpy specific stuff here that we need to transform to carrays. Here we actually have the result to make the groupby columns already (as we have the unique set of combinations of all input columns).
The groupsort_indexer is a check whether it does a does a counted sort or a merge sort (when the amount of theoretical values is a lot bigger than the actual size, do a mergesort, otherwise a counted sort). Alternatively you could also do another factorization here of all the actual values and then count sort that (as that will always deliver a max amount of values to the length of the carray, if it's unique) -> i do that on page 8 of my discussion doc actually. Pandas also compresses the index but in another place I think, see "_compress_group_index()" in the code there if you're really interested ;)
(I think that for now we can stick to counted sorts at least), the groupsort_indexer is the counted sort that Francesc E. implemented above. So it gives back the sorted result as on page 10 of my doc. So, a bit more to do, but with factorize and counted sorting in place, it's now first the index multiplication (in case of >1 groupby columns), then making the ctable with groupby columns and then doing the actual agg calcs using the index from counted sorts. If i'm not mistaken/confused/silly that is ;) |
@FrancescElies the groupsort_indexer output looks good (haven't looked at the code). One possible optimization may be to count the group sizes during factorize. That way you need to loop over the data just once to obtain a) the groups and b) their sizes (maybe you are doing this already?). @CarstVaartjes forgive me, but I'll need some time to digest your post, but my initial hunch is that we are on the right track. |
@FrancescElies one quick note about: https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/carray_ext.pyx#L2731 I believe you could use |
@esc I will, indeed much more elegant, thanks! |
@esc it was a long post te important information was kind of hidden, sometimes it's hard to summarize In [14]: print result, counts
[ 0 8 9 13 16 19 1 2 7 10 12 17 18 3 4 5 11 14 6 15] [0 6 7 5 2] As you see the second array contains the length of the groups 6 times NY, 7 times CA, 5 times IL , 2 times OH. Indicating that the first 6 numbers inside the first array are the positions where you find NY, the next 7 numbers are the positions for CA and so on. |
@FrancescElies ah yes! I did see that. But, it is a second step, i.e. part of the However we must consider how we want to count. In your approach you know how many groups there are, so you can use a simple indexing into an array. If on the other hand, we try to count during factorize, we don't yet know how many groups there, and so we need something that can grow, e.g. a dictionary. An alternative might be to use a hashtable based Trade-offs everywhere! Sorry If I am getting confusing, it is so late again. :yawn: |
A hashtable based unique is almost the same as a factorize (only you do not save the factorized carray with index values, but just the dict/list with unique values), so I think you would still be doing the double. |
Yes, doing unique and then factorize + count is double, like doing factorize and then count. But what happens inside the loops and the difference in data-structures that one can use is slightly different, so it might have different performance characteristics. Anyway, no use speculating w/o the benchmarks. :) regarding the preallocation, I think you may even get away with just allocating a zero-length array and appending to it. But of course if you just use a small portion of the array and truncate the rest it will be (or should be) efficient too. |
@esc sorry for the misunderstanding & thanks for the ideas. |
Today while exploring some alternatives I ended up with the following results In [1]: import numpy, random, bcolz, tempfile, os, shutil
In [2]: N = int(1e5)
In [3]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(N)), dtype='S2')
In [4]: rootdir = tempfile.mkdtemp(prefix='bcolz-factorize')
In [5]: shutil.rmtree(rootdir)
In [6]: c = bcolz.carray(a, rootdir=rootdir)
In [7]: bcolz.carray_ext.factorize_cython(c)
Out[7]:
(carray((100000,), uint64)
nbytes: 781.25 KB; cbytes: 320.09 KB; ratio: 2.44
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[0 1 2 ..., 2 1 3], {0: 'OH', 1: 'NY', 2: 'IL', 3: 'CA'})
In [8]: %timeit bcolz.carray_ext.factorize_cython(c)
100 loops, best of 3: 10.6 ms per loop
In [9]: %timeit bcolz.carray_ext.factorize_with_counts_cython(c)
100 loops, best of 3: 13.1 ms per loop
In [10]: %timeit bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
1 loops, best of 3: 2.35 s per loop
In [11]: %timeit bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
100 loops, best of 3: 17.2 ms per loop
In [12]: shutil.rmtree(rootdir) Also factorizing with counting inside or without seems to be only slightly slower, relooping while grouping seems to have a big difference, I still don't know if this is because of looping again (like @esc pointed) or In any case these are the Benchmarks I could get I also tried to make it work with carrays the only results I could get were really slow (for sure because of me not using them correctly) |
Well the problem was actually me looping labels inside groupsort function incorrectly, I should have looped the chunks and then the leftovers. Inside groupsort one can find other arrays npdarray that could be converted to carrays but for the sake of readability I did not converted because of the split loop for chunks and leftovers as @esc already mentioned before (open issue #73) At least this time both results are the same order of magnitude. Note: In [7]: bcolz.carray_ext.factorize_cython(c)
Out[7]:
(carray((100000,), uint64)
nbytes: 781.25 KB; cbytes: 320.04 KB; ratio: 2.44
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[0 0 1 ..., 3 0 1], {0: 'CA', 1: 'IL', 2: 'OH', 3: 'NY'})
In [8]: bcolz.carray_ext.factorize_with_counts_cython(c)
Out[8]:
(carray((100000,), uint64)
nbytes: 781.25 KB; cbytes: 320.04 KB; ratio: 2.44
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[0 0 1 ..., 3 0 1],
{0: 'CA', 1: 'IL', 2: 'OH', 3: ''},
[0, 24965, 24937, 24842, 25256])
In [9]: bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
Out[9]:
(array([ 0, 1, 20, ..., 99990, 99993, 99997], dtype=uint64),
array([ 0, 24965, 24937, 24842, 25256], dtype=uint64))
In [10]: bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
Out[10]:
(array([ 0, 1, 20, ..., 99990, 99993, 99997]),
[0, 24965, 24937, 24842, 25256])
In [11]: %timeit bcolz.carray_ext.factorize_cython(c)
10 loops, best of 3: 18 ms per loop
In [12]: %timeit bcolz.carray_ext.factorize_with_counts_cython(c)
100 loops, best of 3: 14.9 ms per loop
In [13]: %timeit bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
10 loops, best of 3: 27.2 ms per loop
In [14]: %timeit bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
100 loops, best of 3: 19.4 ms per loop |
If N gets bigger In [27]: %timeit bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
1 loops, best of 3: 3.08 s per loop
In [28]: %timeit bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
1 loops, best of 3: 1.87 s per loop |
Hi @esc! You can close this one as it's been decided to move to a separate package (which will plug in to the new .pxd file) |
Okidokey. |
This is related to #62 and #63.
I made the groupsort_indexer cythonized version from pandas work with carrays from bcolz see
groupsort_indexer_cython
in link below.It seems to work, but to do it right I think I need some help, I am not completely sure if from the performance persperctive I did the right thing, at least the output I could examine so far, looks good.
You can find the code here:
https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/carray_ext.pyx#L2721-2721
cpasted funtion is used for comparison purposes (taken from pandas)
The text was updated successfully, but these errors were encountered: