Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

groupsort for bcolz #76

Closed
FrancescElies opened this issue Oct 16, 2014 · 15 comments
Closed

groupsort for bcolz #76

FrancescElies opened this issue Oct 16, 2014 · 15 comments

Comments

@FrancescElies
Copy link
Member

This is related to #62 and #63.

I made the groupsort_indexer cythonized version from pandas work with carrays from bcolz see groupsort_indexer_cython in link below.

It seems to work, but to do it right I think I need some help, I am not completely sure if from the performance persperctive I did the right thing, at least the output I could examine so far, looks good.

You can find the code here:
https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/carray_ext.pyx#L2721-2721

cpasted funtion is used for comparison purposes (taken from pandas)

In [1]: import numpy as np

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def groupsort_indexer(labels, reverse):
:    ngroups = len(reverse)
:    # count group sizes, location 0 for NA
:    counts = np.zeros(ngroups + 1, dtype=np.int64)
:    n = len(labels)
:    for i in range(n):
:        counts[labels[i] + 1] += 1
:
:    # mark the start of each contiguous group of like-indexed data
:    where = np.zeros(ngroups + 1, dtype=np.int64)
:    for i in range(1, ngroups + 1):
:        where[i] = where[i - 1] + counts[i - 1]
:
:    # this is our indexer
:    result = np.zeros(n, dtype=np.int64)
:    for i in range(n):
:        label = labels[i] + 1
:        result[where[label]] = i
:        where[label] += 1
:    return result, counts
:--

In [3]: import numpy, random, bcolz, tempfile, shutil

In [4]: random.seed(1)

In [5]: N = int(20)

In [6]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(N)), dtype='S2')

In [7]: rootdir = tempfile.mkdtemp(prefix='bcolz-factorize')

In [8]: shutil.rmtree(rootdir)

In [9]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'), rootdir=rootdir)

In [10]: print c
['NY' 'CA' 'CA' 'IL' 'IL' 'IL' 'OH' 'CA' 'NY' 'NY' 'CA' 'IL' 'CA' 'NY' 'IL'
 'OH' 'NY' 'CA' 'CA' 'NY']

In [11]: labels, reverse = bcolz.carray_ext.factorize_cython(c)

In [12]: print labels, reverse
[0 1 1 2 2 2 3 1 0 0 1 2 1 0 2 3 0 1 1 0] {0: 'NY', 1: 'CA', 2: 'IL', 3: 'OH'}

In [13]: result, counts = bcolz.carray_ext.groupsort_indexer_cython(labels, reverse)

In [14]: print result, counts
[ 0  8  9 13 16 19  1  2  7 10 12 17 18  3  4  5 11 14  6 15] [0 6 7 5 2]

In [15]: result_ref, counts_ref = groupsort_indexer(labels, reverse)

In [16]: print result_ref, counts_ref
[ 0  8  9 13 16 19  1  2  7 10 12 17 18  3  4  5 11 14  6 15] [0 6 7 5 2]

In [17]: shutil.rmtree(rootdir)
@FrancescElies
Copy link
Member Author

Note: I see carrays are still in memory

@CarstVaartjes
Copy link
Contributor

See also http://www.slideshare.net/cvaartjes/bcolz-groupby-discussion-document

So we have factorizing in place. If we look at page 4 of the doc, we apply first the factorization for all groupby columns.
Then we arrive at we're we at now: combine the columns them together, factorize that and sort them: (NB: In pandas if it's just one groupby column it takes a different path where it sorts the label_list directly, but personally I would prefer to keep everything in one path). Anyway, the pandas routine to combine it (label_list and keys are the lists containing in order the factorization results of the groupby) ->

def _get_indices_dict(label_list, keys):
    shape = list(map(len, keys))
    ngroups = np.prod(shape)

    group_index = get_group_index(label_list, shape)
    sorter = _get_group_index_sorter(group_index, ngroups)

    sorted_labels = [lab.take(sorter) for lab in label_list]
    group_index = group_index.take(sorter)

    return lib.indices_fast(sorter, group_index, keys, sorted_labels)

so it calculates the theoretical amount of possible groups: ngroups (see page 7 of the document):

    shape = list(map(len, keys))
    ngroups = np.prod(shape)

and then it calculates the group_index (page 8 of the document), with this code:

group_index = get_group_index(label_list, shape) -> 

def get_group_index(label_list, shape):
    """
    For the particular label_list, gets the offsets into the hypothetical list
    representing the totally ordered cartesian product of all possible label
    combinations.
    """
    if len(label_list) == 1:
        return label_list[0]

    n = len(label_list[0])
    group_index = np.zeros(n, dtype=np.int64)
    mask = np.zeros(n, dtype=bool)
    for i in range(len(shape)):
        stride = np.prod([x for x in shape[i + 1:]], dtype=np.int64)
        group_index += com._ensure_int64(label_list[i]) * stride
        mask |= label_list[i] < 0

    np.putmask(group_index, mask, -1)
    return group_index

The group_index here is what I called groupby_input in slide 8. NB: some numpy specific stuff here that we need to transform to carrays. Here we actually have the result to make the groupby columns already (as we have the unique set of combinations of all input columns).
Now Pandas creates something it calls a sorter:

sorter = _get_group_index_sorter(group_index, ngroups)

The groupsort_indexer is a check whether it does a does a counted sort or a merge sort (when the amount of theoretical values is a lot bigger than the actual size, do a mergesort, otherwise a counted sort). Alternatively you could also do another factorization here of all the actual values and then count sort that (as that will always deliver a max amount of values to the length of the carray, if it's unique) -> i do that on page 8 of my discussion doc actually. Pandas also compresses the index but in another place I think, see "_compress_group_index()" in the code there if you're really interested ;)
Anyway, back to sorting:

def _get_group_index_sorter(group_index, ngroups):
    """
    _algos.groupsort_indexer implements `counting sort` and it is at least
    O(ngroups), where
        ngroups = prod(shape)
        shape = map(len, keys)
    that is, linear in the number of combinations (cartesian product) of unique
    values of groupby keys. This can be huge when doing multi-key groupby.
    np.argsort(kind='mergesort') is O(count x log(count)) where count is the
    length of the data-frame;
    Both algorithms are `stable` sort and that is necessary for correctness of
    groupby operations. e.g. consider:
        df.groupby(key)[col].transform('first')
    """
    count = len(group_index)
    alpha = 0.0  # taking complexities literally; there may be
    beta  = 1.0  # some room for fine-tuning these parameters
    if alpha + beta * ngroups < count * np.log(count):
        sorter, _ = _algos.groupsort_indexer(com._ensure_int64(group_index),
                                             ngroups)
        return com._ensure_platform_int(sorter)
    else:
        return group_index.argsort(kind='mergesort')

(I think that for now we can stick to counted sorts at least), the groupsort_indexer is the counted sort that Francesc E. implemented above. So it gives back the sorted result as on page 10 of my doc.

So, a bit more to do, but with factorize and counted sorting in place, it's now first the index multiplication (in case of >1 groupby columns), then making the ctable with groupby columns and then doing the actual agg calcs using the index from counted sorts.

If i'm not mistaken/confused/silly that is ;)

@esc
Copy link
Member

esc commented Oct 16, 2014

@FrancescElies the groupsort_indexer output looks good (haven't looked at the code). One possible optimization may be to count the group sizes during factorize. That way you need to loop over the data just once to obtain a) the groups and b) their sizes (maybe you are doing this already?).

@CarstVaartjes forgive me, but I'll need some time to digest your post, but my initial hunch is that we are on the right track.

@esc
Copy link
Member

esc commented Oct 16, 2014

@FrancescElies one quick note about:

https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/carray_ext.pyx#L2731

I believe you could use bcolz.zeros here.

@FrancescElies
Copy link
Member Author

@esc I will, indeed much more elegant, thanks!

@FrancescElies
Copy link
Member Author

@esc it was a long post te important information was kind of hidden, sometimes it's hard to summarize

In [14]: print result, counts
[ 0  8  9 13 16 19  1  2  7 10 12 17 18  3  4  5 11 14  6 15] [0 6 7 5 2]

As you see the second array contains the length of the groups 6 times NY, 7 times CA, 5 times IL , 2 times OH. Indicating that the first 6 numbers inside the first array are the positions where you find NY, the next 7 numbers are the positions for CA and so on.

@esc
Copy link
Member

esc commented Oct 16, 2014

@FrancescElies ah yes! I did see that. But, it is a second step, i.e. part of the groupsort_indexer_cython My suggestion was to do this directly as part of factorize. So for example factorize would return labels, reverse, counts. It might be faster this way, since you only loop over the data once before sorting.

However we must consider how we want to count. In your approach you know how many groups there are, so you can use a simple indexing into an array. If on the other hand, we try to count during factorize, we don't yet know how many groups there, and so we need something that can grow, e.g. a dictionary.

An alternative might be to use a hashtable based unique first, to determine the groups and how many there are. In the second step do both the factorization and the counting. The advantage of doing it that way, is that you know the group size ahead of time, i.e. you can chose the correct integer size for the labels and you can allocate a fixed-size array (numpy or carray depending on the size) to hold the counts.

Trade-offs everywhere! Sorry If I am getting confusing, it is so late again. :yawn:

@CarstVaartjes
Copy link
Contributor

A hashtable based unique is almost the same as a factorize (only you do not save the factorized carray with index values, but just the dict/list with unique values), so I think you would still be doing the double.
What might be nice to do is to just make another carray as long as the input carray (which will compress really well) and resize it in the end to the actual value? So you go from worst case, to an actual case. The nice thing that with an out-of-core solution with compression this should not be an incredibly bad thing, I think.

@esc
Copy link
Member

esc commented Oct 16, 2014

Yes, doing unique and then factorize + count is double, like doing factorize and then count. But what happens inside the loops and the difference in data-structures that one can use is slightly different, so it might have different performance characteristics. Anyway, no use speculating w/o the benchmarks. :)

regarding the preallocation, I think you may even get away with just allocating a zero-length array and appending to it. But of course if you just use a small portion of the array and truncate the rest it will be (or should be) efficient too.

@FrancescElies
Copy link
Member Author

@esc sorry for the misunderstanding & thanks for the ideas.
I'll try to explore them tomorrow (hopefully with a fresh mind)

@FrancescElies
Copy link
Member Author

Today while exploring some alternatives I ended up with the following results

In [1]: import numpy, random, bcolz, tempfile, os, shutil

In [2]: N = int(1e5)

In [3]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(N)), dtype='S2')

In [4]: rootdir = tempfile.mkdtemp(prefix='bcolz-factorize')

In [5]: shutil.rmtree(rootdir)

In [6]: c = bcolz.carray(a, rootdir=rootdir)

In [7]: bcolz.carray_ext.factorize_cython(c)
Out[7]:
(carray((100000,), uint64)
   nbytes: 781.25 KB; cbytes: 320.09 KB; ratio: 2.44
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 2 ..., 2 1 3], {0: 'OH', 1: 'NY', 2: 'IL', 3: 'CA'})

In [8]: %timeit bcolz.carray_ext.factorize_cython(c)
100 loops, best of 3: 10.6 ms per loop

In [9]: %timeit bcolz.carray_ext.factorize_with_counts_cython(c)
100 loops, best of 3: 13.1 ms per loop

In [10]: %timeit bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
1 loops, best of 3: 2.35 s per loop

In [11]: %timeit bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
100 loops, best of 3: 17.2 ms per loop

In [12]: shutil.rmtree(rootdir)

Also factorizing with counting inside or without seems to be only slightly slower, relooping while grouping seems to have a big difference, I still don't know if this is because of looping again (like @esc pointed) or groupsort_indexer_cython is still having something non good performance operations (some of them @CarstVaartjes already corrected)

In any case these are the Benchmarks I could get
Code can be found at the end of the following file
https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/carray_ext.pyx#L2628-2628

I also tried to make it work with carrays the only results I could get were really slow (for sure because of me not using them correctly)

@FrancescElies
Copy link
Member Author

Well the problem was actually me looping labels inside groupsort function incorrectly, I should have looped the chunks and then the leftovers.

Inside groupsort one can find other arrays npdarray that could be converted to carrays but for the sake of readability I did not converted because of the split loop for chunks and leftovers as @esc already mentioned before (open issue #73)

At least this time both results are the same order of magnitude.

Note: factorize_with_counts_cython has still a small bug somewhere NY is missing (Out[8]).

In [7]: bcolz.carray_ext.factorize_cython(c)
Out[7]:
(carray((100000,), uint64)
   nbytes: 781.25 KB; cbytes: 320.04 KB; ratio: 2.44
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 0 1 ..., 3 0 1], {0: 'CA', 1: 'IL', 2: 'OH', 3: 'NY'})

In [8]: bcolz.carray_ext.factorize_with_counts_cython(c)
Out[8]:
(carray((100000,), uint64)
   nbytes: 781.25 KB; cbytes: 320.04 KB; ratio: 2.44
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 0 1 ..., 3 0 1],
 {0: 'CA', 1: 'IL', 2: 'OH', 3: ''},
 [0, 24965, 24937, 24842, 25256])

In [9]: bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
Out[9]:
(array([    0,     1,    20, ..., 99990, 99993, 99997], dtype=uint64),
 array([    0, 24965, 24937, 24842, 25256], dtype=uint64))

In [10]: bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
Out[10]:
(array([    0,     1,    20, ..., 99990, 99993, 99997]),
 [0, 24965, 24937, 24842, 25256])

In [11]: %timeit bcolz.carray_ext.factorize_cython(c)
10 loops, best of 3: 18 ms per loop

In [12]: %timeit bcolz.carray_ext.factorize_with_counts_cython(c)
100 loops, best of 3: 14.9 ms per loop

In [13]: %timeit bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
10 loops, best of 3: 27.2 ms per loop

In [14]: %timeit bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
100 loops, best of 3: 19.4 ms per loop

@FrancescElies
Copy link
Member Author

If N gets bigger N = int(1e7) counting while factorizing seems to make a difference.

In [27]: %timeit bcolz.carray_ext.groupsort_indexer_cython(*bcolz.carray_ext.factorize_cython(c))
1 loops, best of 3: 3.08 s per loop

In [28]: %timeit bcolz.carray_ext.groupsort_indexer_without_counts_cython(*bcolz.carray_ext.factorize_with_counts_cython(c))
1 loops, best of 3: 1.87 s per loop

@CarstVaartjes
Copy link
Contributor

Hi @esc! You can close this one as it's been decided to move to a separate package (which will plug in to the new .pxd file)

@esc
Copy link
Member

esc commented Dec 30, 2014

Okidokey.

@esc esc closed this as completed Dec 30, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants