Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

factorize for bcolz #63

Closed
wants to merge 23 commits into from
Closed

factorize for bcolz #63

wants to merge 23 commits into from

Conversation

esc
Copy link
Member

@esc esc commented Oct 7, 2014

No description provided.

@esc
Copy link
Member Author

esc commented Oct 7, 2014

The first implementation factorize_pure is a slow python implementation, here is how it compares to pandas:

In [1]: import bcolz

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import random

In [5]: a = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [6]: %timeit pd.factorize(a)
1 loops, best of 3: 952 ms per loop

In [8]: c = bcolz.carray(a)

In[9]: import bcolz.algos

In [11]: %timeit bcolz.algos.factorize_pure(c)
1 loops, best of 3: 51.7 s per loop

Also, here is the profiling output:

%prun -l 20 -s cumulative bcolz.algos.factorize_pure(c)

         40000490 function calls in 62.945 seconds

   Ordered by: cumulative time
   List reduced from 28 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   62.945   62.945 <string>:1(<module>)
        1   10.059   10.059   62.945   62.945 algos.py:5(factorize_pure)
 10000000   15.851    0.000   52.886    0.000 {method 'append' of 'bcolz.carray_ext.carray' objects}
 10000001   25.481    0.000   37.035    0.000 utils.py:102(to_ndarray)
 10000001   10.504    0.000   10.504    0.000 {numpy.core.multiarray.array}
 10000005    1.050    0.000    1.050    0.000 {len}
      152    0.000    0.000    0.000    0.000 toplevel.py:571(clevel)
      152    0.000    0.000    0.000    0.000 toplevel.py:576(shuffle)
      152    0.000    0.000    0.000    0.000 toplevel.py:581(cname)
        1    0.000    0.000    0.000    0.000 collections.py:38(__init__)
        1    0.000    0.000    0.000    0.000 _abcoll.py:545(update)
        1    0.000    0.000    0.000    0.000 utils.py:72(calc_chunksize)
        4    0.000    0.000    0.000    0.000 collections.py:54(__setitem__)
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 toplevel.py:631(__init__)
        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
        1    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:70(__contains__)
        2    0.000    0.000    0.000    0.000 {math.log10}
        1    0.000    0.000    0.000    0.000 utils.py:52(csformula)

It seems like the to_ndarray is quite heavily involved here.

@esc
Copy link
Member Author

esc commented Oct 7, 2014

Added an optimization (factorize_pure2), where the array is converted and appended chunk-wise:

In [7]: %timeit bcolz.algos.factorize_pure2(c)
1 loops, best of 3: 4.5 s per loop

And the corresponding profile:

         606 function calls in 4.560 seconds

   Ordered by: cumulative time
   List reduced from 29 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.560    4.560 <string>:1(<module>)
        1    4.511    4.511    4.560    4.560 algos.py:23(factorize_pure2)
       38    0.047    0.001    0.048    0.001 {method 'append' of 'bcolz.carray_ext.carray' objects}
       39    0.000    0.000    0.001    0.000 utils.py:102(to_ndarray)
      152    0.000    0.000    0.000    0.000 toplevel.py:571(clevel)
      152    0.000    0.000    0.000    0.000 toplevel.py:576(shuffle)
        1    0.000    0.000    0.000    0.000 collections.py:38(__init__)
      152    0.000    0.000    0.000    0.000 toplevel.py:581(cname)
       44    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 _abcoll.py:545(update)
        1    0.000    0.000    0.000    0.000 utils.py:72(calc_chunksize)
        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.empty}
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
        4    0.000    0.000    0.000    0.000 collections.py:54(__setitem__)
        1    0.000    0.000    0.000    0.000 toplevel.py:631(__init__)
        1    0.000    0.000    0.000    0.000 utils.py:52(csformula)
        1    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:70(__contains__)

I think that is a pretty nice achievement over the naive approach, but can probably do better yet still.

@esc
Copy link
Member Author

esc commented Oct 7, 2014

And with profiling for cython active we get:

         1703 function calls (1690 primitive calls) in 4.356 seconds

   Ordered by: cumulative time
   List reduced from 59 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.356    4.356 <string>:1(<module>)
        1    4.275    4.275    4.355    4.355 algos.py:23(factorize_pure2)
       38    0.000    0.000    0.049    0.001 {method 'append' of 'bcolz.carray_ext.carray' objects}
       38    0.001    0.000    0.049    0.001 carray_ext.pyx:1334(append)
      152    0.000    0.000    0.047    0.000 carray_ext.pyx:305(__cinit__)
      152    0.000    0.000    0.047    0.000 carray_ext.pyx:357(compress_arrdata)
      152    0.046    0.000    0.046    0.000 carray_ext.pyx:406(compress_data)
       39    0.001    0.000    0.031    0.001 carray_ext.pyx:489(__getitem__)
       39    0.030    0.001    0.030    0.001 carray_ext.pyx:462(_getitem)
       39    0.000    0.000    0.000    0.000 utils.py:102(to_ndarray)
        1    0.000    0.000    0.000    0.000 collections.py:38(__init__)
      8/2    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        1    0.000    0.000    0.000    0.000 carray_ext.pyx:988(__cinit__)
        1    0.000    0.000    0.000    0.000 carray_ext.pyx:1034(create_carray)
        1    0.000    0.000    0.000    0.000 _abcoll.py:545(update)
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
     12/5    0.000    0.000    0.000    0.000 {issubclass}
        5    0.000    0.000    0.000    0.000 _weakrefset.py:36(__init__)
      152    0.000    0.000    0.000    0.000 carray_ext.pyx:243(check_zeros)

@esc
Copy link
Member Author

esc commented Oct 7, 2014

I made a cython version too but it's only marginally faster than the second pure python approach, but probably one can tweak the cython some more.

@esc
Copy link
Member Author

esc commented Oct 9, 2014

I tweaked the code some more and am now at a factor of 2 slower than pandas:

In [1]: import bcolz, bcolz.algos, numpy, random

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a)

In [4]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.95 s per loop

In [5]: import pandas

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 971 ms per loop

@CarstVaartjes
Copy link
Contributor

Hi, with the cython code "independentized" from Pandas (how do you say that haha), the klib upgrade seems to give a 20% increase for in64 and a 300% increase for strings:

import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable

def new_klib_int(input_array):
    ht = Int64HashTable(len(input_array))
    return ht.get_labels_groupby(input_array)

def new_klib_str(input_array):
    ht = StringHashTable(len(input_array))
    return ht.factorize(input_array)

a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values

%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)

In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop

In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop

In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop

In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop

@esc
Copy link
Member Author

esc commented Oct 10, 2014

@CarstVaartjes that is pretty good. I'll perhaps try to use the klib hashtable in bcolz soon. The difference at the moment is that the 'labels' returned are a Numpy array. for bcolz this should perhaps be a carray? So for example you could store the integer labels as another column in the ctable.

As a side note, you are using an int64 for the labels. Depending on the cardinality of the categorical/factor you may get away with an unsigned smaller version of the data, e.g. uint8. You could compute the number of unique values ahead of time,or alternatively downcast the array after you have create the labels, since the size of the hashtable will tell you what size of integer you will need. Either one will however double the runtime but decrease the space needed. For a carray however, due to the shuffle filter in blosc, that shouldn't matter much since all the zeros in the large integers will compressed away.

Another interesting read would be the recent addition of the Categorical datatype to Pandas:

https://pandas-docs.github.io/pandas-docs-travis/categorical.html

Effectively factorize should return an object of the categorical type and my gut feeling is that if bcolz is to become a useful tool for data analysis, we will need first class categorical datatypes.

@CarstVaartjes
Copy link
Contributor

Hi Valentin,

we'll include it in #62 and yes want to convert it there to a version that uses a carray version! Main issue is that there is no carray_ext pxd file yet (giving headaches to develop separate cython files) + I worry about writing to a carray sequentially and then sorting it (because of the compressing performance penalty it triggers while randomly writing)
But in general, we should be okay for a good groupby!

On categorical: we ourselves do something exactly like that in our architecture (we work with separate reference tables that factorize all datetime and string objects into int64 identifiers before writing things to hdf5 and in the near future hopefully bcolz) (we are a small startup that does data analytics for retailers and fmcg manufacturers). This might be a nice setup in general, i'm not sure of the python object implementation in pandas isn't overdoing it while a "translated array" (with the integers) and a "translation array" with the value as separately saved arrays might be much better

@esc esc mentioned this pull request Oct 10, 2014
@esc
Copy link
Member Author

esc commented Oct 10, 2014

I did do some more experimentation by using the kash stuff you provided and it turns out to be quite alot faster than the standard python dict:

In [1]: import bcolz, bcolz.algos, numpy, random

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
c 
In [3]: c = bcolz.carray(a)

In [4]: bcolz.carray_ext.factorize_cython(c)[0]
Out[4]: 
carray((9961472,), uint8)
  nbytes: 9.50 MB; cbytes: 5.53 MB; ratio: 1.72
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[0 0 0 ..., 1 2 2]

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.27 s per loop

In [6]: import pandas

In [7]: %timeit pandas.factorize(a)
1 loops, best of 3: 1.01 s per loop

And we are now only about 30ms slower than pandas on this particular example.

@esc
Copy link
Member Author

esc commented Oct 10, 2014

Stragely enough using typed numpy arrays, we become faster than pandas:

In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: c
Out[4]: 
carray((10000000,), |S2)
  nbytes: 19.07 MB; cbytes: 8.73 MB; ratio: 2.18
  cparams := cparams(clevel=5, shuffle=False, cname='blosclz')
['NY' 'IL' 'CA' ..., 'NY' 'NY' 'OH']

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 687 ms per loop

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 1.03 s per loop

In [7]: pandas.__version__
Out[7]: '0.14.0'

Though I'm not sure if what I am seeing is correct it is getting quite late and I fear I made some kind of error somewhere. Anyway, I'll be double-checking the results again soon.

@CarstVaartjes
Copy link
Contributor

It's quicker than Pandas because it's a updated version of klib, see: pandas-dev/pandas#8524

@esc
Copy link
Member Author

esc commented Oct 10, 2014

@CarstVaartjes ah I see, so I'll wait for the new klib to make it into pandas and then test again, but good to know that it might be comparable.

@FrancescElies
Copy link
Member

trying to catch up with your current code base, I did some tests with klib 0.2.8 and the old 0.2.6, both lead to very similar results, factorize from @esc is at least for me the new klib as fast as the old one with this example.
With my laptop is not faster than pandas, so bcolz and the way factorize are built may play a role on the number we see (see below both results)

klib_version = '0.2.8'
In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: c
Out[4]: 
carray((10000000,), |S2)
  nbytes: 19.07 MB; cbytes: 8.73 MB; ratio: 2.18
  cparams := cparams(clevel=5, shuffle=False, cname='blosclz')
['NY' 'NY' 'CA' ..., 'IL' 'IL' 'CA']

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.18 s per loop

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 943 ms per loop

In [7]: pandas.__version__
Out[7]: '0.14.0'
klib_version = '0.2.6'
In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: c
Out[4]: 
carray((10000000,), |S2)
  nbytes: 19.07 MB; cbytes: 8.73 MB; ratio: 2.18
  cparams := cparams(clevel=5, shuffle=False, cname='blosclz')
['NY' 'IL' 'IL' ..., 'CA' 'CA' 'IL']

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.16 s per loop

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 939 ms per loop

In [7]: pandas.__version__
Out[7]: '0.14.0'

@FrancescElies
Copy link
Member

Hi Valentin (@esc)

do you also get this result for bcolz factorize or am I missing something?
The factorize bcolz result looks a bit strange, the output below comes with klib 0.2.8, if klib 0.2.6 is used then it looks also different.

Thanks

In [7]: result_bcolz = bcolz.carray_ext.factorize_cython(c)

In [8]: result_pandas = pandas.factorize(a)

In [9]: result_pandas
Out[9]: (array([0, 1, 2, ..., 2, 1, 2]), array(['IL', 'OH', 'NY', 'CA'], dtype=object))

In [10]: result_bcolz
Out[10]: 
(carray((9961472,), uint8)
   nbytes: 9.50 MB; cbytes: 6.88 MB; ratio: 1.38
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 0 ..., 6 6 1],
 {0: 'IL',
  1: 'OH',
  2: 'CA',
  3: 'OH',
  4: 'CA',
  5: 'NY',
  6: 'OH',
  7: 'NY',
  8: 'NY',
  9: 'NY',
  10: 'CA',
  11: 'CA',
  12: 'IL',
  13: 'CA',
  14: 'CA',
  15: 'NY',
  16: 'CA',
  17: 'NY',
  18: 'OH',
  19: 'NY',
  20: 'IL',
  21: 'IL',
  22: 'NY',
  23: 'OH',
  24: 'IL',
  25: 'OH',
  26: 'IL',
  27: 'IL',
  28: 'OH',
  29: 'IL',
  30: 'IL'})

@esc
Copy link
Member Author

esc commented Oct 13, 2014

@FrancescElies No that seems to be foobared. I'm double checking it now. BTW: how did you swap in-out the klib?

@esc
Copy link
Member Author

esc commented Oct 13, 2014

I actually get a slightly differnt result:

In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: result = bcolz.carray_ext.factorize_cython(c)

In [5]: result
Out[5]: 
(carray((9961472,), uint8)
   nbytes: 9.50 MB; cbytes: 5.53 MB; ratio: 1.72
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 0 ..., 0 0 1], {0: 'NY', 1: 'OH', 2: 'CA'})

And after inserting some print statements, it seems like NY and IL actually hash to the same value.

@esc
Copy link
Member Author

esc commented Oct 13, 2014

Switching back to the python dictionary gives correct results once again:

In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(500000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: result = bcolz.carray_ext.factorize_cython(c)

In [5]: result
Out[5]: 
(carray((393216,), uint8)
   nbytes: 384.00 KB; cbytes: 410.34 KB; ratio: 0.94
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 0 ..., 3 1 1], {'CA': 1, 'IL': 3, 'NY': 0, 'OH': 2})

In [6]: c[10]
Out[6]: 'OH'

In [7]: c[:10]
Out[7]: 
array(['NY', 'CA', 'NY', 'CA', 'CA', 'OH', 'OH', 'CA', 'NY', 'NY'], 
      dtype='|S2')

In [8]: result[0][:10]
Out[8]: array([0, 1, 0, 1, 1, 2, 2, 1, 0, 0], dtype=uint8)

@CarstVaartjes
Copy link
Contributor

I can see why you would want to have a split between the core and the analytics part (maintenance, bloat, etc.); some things which are different or have to look at from my humble PoV:

  • the ctable is something that is in Pandas rather than numpy (not that I'm advocating that you should do that btw! I just mean it's a difference, though it's of course really good that bcolz already has it)
  • Make a carray_ext.pxd file #68 would be necessary (a good thing in general of course, but also necessary); we looked at it, but it will be a mighty complicated pxd with all the available functions
  • some things are a bit grey areas, e.g.
    • factorizing is necessary for analytics, but maybe nice to have as core functionality for other apps
    • caching of a factorization result (the index array and the unique value array/dict) can be great; if it's in bcolz you might have it as a direct function on a carray / table columns, but it can also work from the analytics part though in that case you cannot enforce re-indexing after create/update/deletes if a second process writes from them
  • if you want to go to add layers on top such as multi-server distribution (for load balancing and/or failover), data access security models, etc. having them bcolz and analytics as separate modules would be good (because that way you can plug in a model in between)
  • with separate modules we might be able to also open source some of our internal stuff (s3 synching, metadata models) which definitely are better for a higher level module (you would get bcolz -> load balancing -> analytics -> meta layer with security check, data definitions & web calls)

Just some thoughts. Probably the earlier this would happen the better right? So the pxd is then really high prio?

@FrancescElies
Copy link
Member

@esc you can find the tests in
https://github.com/FrancescElies/bcolz/blob/factorize/bcolz/tests/test_carray.py
factorizeStringsTest, test01 is pandas dependent which I am not sure if this is a good idea.

You can also find factorizeIntsTest but this is obiously not working since it would test factorize for ints.

I also started having a look to groupsort from pandas
https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/algos.py#L44-44

At the moment I am having a look at the infos you sent, I am open to any comments or things you would like to me to change in case you were interested, thanks!

@FrancescElies FrancescElies mentioned this pull request Oct 16, 2014
@FrancescElies
Copy link
Member

@esc, inside factorize_cython would it be a good idea to type the reverse parameter as carry? Though should normally not be the case of having all unique values, potentially this dict could be also quite big in some particular cases.

@esc
Copy link
Member Author

esc commented Oct 16, 2014

@FrancescElies I had been thinking about that. In principle the lookup is indexed and ordered so it is possible to replace the dict with a carray. Also, since carray appends are (should be) lighting fast for the most part you could simply append to the carray every time you encounter a new unique value w/o much overhead. However only benchmarking will tell about this performance. In my first initial implementation of factorize I noticed that there may be some speed issues with appending scalars to carrays. That will most probably have to be looked at.

@CarstVaartjes
Copy link
Contributor

edit: oops, misread the whole thing. the helper was about reading a chunk at a time, not writing it (so the stuff below doesn't make sense)
edit2: no, it actually might make sense ;) as it reads a chunk at a time but then buffers the writing

btw about the issue with appends in chunks in the _factorize_helper_with_counts, isn't this a possible solution?

class carray_buf():

    def __init__(self, carray_input):
        self.carray = carray_input
        self.length = carray_input.chunklen
        self.index = 0
        self.value_list = []

    def append(self, value):
        self.index += 1
        self.value_list.append(value)

        if self.index == self.length:
            self.carray.append(self.value_list)
            self.value_list = []
            self.index = 0

    def close(self):
        # first check whether there are actual values to save
        if self.value_list:
            self.carray.append(self.value_list)
            self.carray.flush()
            self.value_list = []
            self.index = 0

(this is based on some similar issues we had inside our algorithms with saving up for hdf5 writes in our current solution). Then you can just directly append to the item each time and it will take care of efficient saving itself.
Some things that would need to be improved:

  1. cythonized with cdefs
  2. during the init the self.index should be set to the current leftovers (so it first fills up the current chunk)

@CarstVaartjes
Copy link
Contributor

Ok so I tried that and while it worked nicely, it made it twice as slow ;) oops!

@FrancescAlted
Copy link
Member

Just to give my opinion on what @esc is wondering, yes, my intention was to keep bcolz really minimal, and my hope was to find in PyToolz/CyToolz a nice counterpart for taking care of the analytics part, but my hopes really vanished as soon as you shown me that you can easily get a 10x increase in performance by implementing groupby internally in bcolz.

Frankly, I don't know where bcolz should go. My initial intention was indeed to keep it simple (I knew that I was not going to have too much time to put on it), but on the other hand, performance is important as well, so perhaps integrating more functionality on it is good. Problem is, at which point should we decide to stop adding high performance functionality? And, if we not put a limit, perhaps are we bounded to replicate pandas but using carray/ctable bcolz objects? If so, it would not be better to make pandas to work on top of bcolz?

Hmm, we are facing an important decision here, and I don't have all the answers. For the time being though I would like to move this discussion to the mailing list, and perhaps involve people from pandas and other related projects to participate and express their thoughts there.

At any rate, I really like that this discussion is happening and encourage you to continue providing suggestions on how bcolz can be improved.

@CarstVaartjes
Copy link
Contributor

Hi Francesc (A. :)

I have some ideas, I will put them on e-mail to you and Valentin later this week! It's probably much better if you involve Jeff Reback etc. as they only know me from posting obscure Pandas & Pytables bugs.
To be honest, I think it will be incredibly hard to disentangle Pandas from Numpy (lots of specific optimizations have been done for it) + see what continuum analytics is doing now. Also to make bcolz perform you need to do some tricks too (see some of the really great stuff @esc did in the factorization) and have to keep in mind the chunks for optimum performance (some algorithms such as counted sort depend on non-sequential writes, which also gives some issues). But with some tweaks out-of-core bcolz is extremely close to pandas, and that alone makes it incredible (and gives it a potential to blast away hadoop etc.)
Anyway, will make a mail later this week!

On the groupby, we're getting really really close. I've extended/adjusted @FrancescElies his code to create a groupby function that can work over multiple columns and also cache factorizations. See: https://github.com/FrancescElies/bcolz/commits/groupsort

import bcolz
from bcolz import carray_ext
import tempfile
import os
import pandas as pd
import numpy as np

projects = [
{'a1': 'build roads',  'a2': 'CA', 'a3': 2, 'm1':   1, 'm2':  2, 'm3':    3.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':   2, 'm2':  3, 'm3':    4.1},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 1, 'm1':   4, 'm2':  5, 'm3':    6.2},
{'a1': 'help farmers', 'a2': 'CA', 'a3': 3, 'm1':   8, 'm2':  9, 'm3':   10.9},
{'a1': 'build roads',  'a2': 'CA', 'a3': 3, 'm1':  16, 'm2': 17, 'm3':   18.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':  32, 'm2': 33, 'm3':   34.3},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 2, 'm1':  64, 'm2': 65, 'm3':   66.6},
{'a1': 'help farmers', 'a2': 'AR', 'a3': 3, 'm1': 128, 'm2': 129, 'm3': 130.9}
]

df = pd.DataFrame(projects * 10000)
print df

prefix = 'bcolz-'
rootdir = tempfile.mkdtemp(prefix=prefix)
os.rmdir(rootdir)  # tests needs this cleared
print(rootdir)
fact_bcolz = bcolz.ctable.fromdataframe(df, rootdir=rootdir)
fact_bcolz.rootdir
self = fact_bcolz

groupby_cols = ['a1', 'a2', 'a3']
fact_bcolz.cache_factor(groupby_cols, refresh=True)  # this caches the factorizations on-disk directly in the rootdir
fact_bcolz.groupby(groupby_cols, {})  # does the first 3 parts of the groupby, see the code

There's room for improvement left of course, with ideally having a metadata later for each column where i can write two carrays (for the factorized index and for the unique values). Also, numexpr cannot handle unsigned int64 unfortunately. But hey, it works! :) It's still slow however compared to Pandas, so more optimizations are required, but i would say: let's make something that works from start to end first, then look at bottlenecks and optimize it

@esc
Copy link
Member Author

esc commented Oct 22, 2014

I'll answer in-line to above using my mailer. Also, I would tent to agree with @FrancescAlted to move any more lengthy discussions to the mailinglists. That way we can use the threading inherent to email in case we need to spin off tangents, the github interface really sucks at this kind of stuff and isn't really suited for more than a few simple replies back and forth.

@esc
Copy link
Member Author

esc commented Oct 22, 2014

Hi,

  • Carst Vaartjes notifications@github.com [2014-10-16]:

    I can see why you would want to have a split between the core and the
    analytics part (maintenance, bloat, etc.); some things which are
    different or have to look at from my humble PoV:

  • the ctable is something that is in Pandas rather than numpy (not
    that I'm advocating that you should do that btw! I just mean it's a
    difference, though it's of course really good that bcolz already has
    it)

Yes, perhaps the ctable needs to move then too? Or everything get's
dumped into bcolz. Probably would work too.

  • Make a carray_ext.pxd file #68 would be necessary (a good thing in general of course, but also
    necessary); we looked at it, but it will be a mighty complicated pxd
    with all the available functions

Hehe, yes.

  • some things are a bit grey areas, e.g.
    • factorizing is necessary for analytics, but maybe nice to have as
      core functionality for other apps

True. Requires the klib stuff though if it wants to be fast.

  • caching of a factorization result (the index array and the unique
    value array/dict) can be great; if it's in bcolz you might have it as
    a direct function on a carray / table columns, but it can also work
    from the analytics part though in that case you cannot enforce
    re-indexing after create/update/deletes if a second process writes
    from them

Yeah, that sucks, maybe we need to hash the contents somehow to a file
on disk and invalidate the factorization if the hash changes.

  • if you want to go to add layers on top such as multi-server
    distribution (for load balancing and/or failover), data access
    security models, etc. having them bcolz and analytics as separate
    modules would be good (because that way you can plug in a model in
    between)

Absolutely. Modular deisgn does have it's benefits. But drawbacks too,
just look at GNU herd.

  • with separate modules we might be able to also open source some of
    our internal stuff (s3 synching, metadata models) which definitely are
    better for a higher level module (you would get bcolz -> load
    balancing -> analytics -> meta layer with security check, data
    definitions & web calls)

That would be nice. I have a bloscpack S3 backend in the mental
pipeline, just need to lay down the code.

Just some thoughts. Probably the earlier this would happen the better
right? So the pxd is then really high prio?

Well, the pxd file would be awesome for sure, the cython code seems a
bit mangled already. Might be tricky, but will be essential for reuse.

@esc
Copy link
Member Author

esc commented Oct 22, 2014

How do you want me to integrate this? Can I cherry-pick the commit and
push it as part of my pull-request? Do you see the factorization stuff
as a single pull-request or should that be a commit in the groupby
branch?

best,

V-

@esc
Copy link
Member Author

esc commented Oct 22, 2014

  • Carst Vaartjes notifications@github.com [2014-10-18]:

    btw about the issue with appends in chunks in the _factorize_helper_with_counts, isn't this a possible solution?

Yeah, I am not sure what is going on there. If the thing to append is a
scalar of the correct type it should just unbox it and place it into the
leftovers array. After placing it there, check if the leftovers is full,
compress it and open a new one that is empty. Maybe that is what is
being done already, but I think it was super slow last time I checked.

class carray_buf():

    def __init__(self, carray_input):
        self.carray = carray_input
        self.length = carray_input.chunklen
        self.index = 0
        self.value_list = []

    def append(self, value):
        self.index += 1
        self.value_list.append(value)

        if self.index == self.length:
            self.carray.append(self.value_list)
            self.value_list = []
            self.index = 0

    def close(self):
        # first check whether there are actual values to save
        if self.value_list:
            self.carray.append(self.value_list)
            self.carray.flush()
            self.value_list = []
            self.index = 0

(this is based on some similar issues we had inside our algorithms with saving up for hdf5 writes in our current solution). Then you can just directly append to the item each time and it will take care of efficient saving itself.
Some things that would need to be improved:

  1. cythonized with cdefs
  2. during the init the self.index should be set to the current leftovers (so it first fills up the current chunk)

Reply to this email directly or view it on GitHub:
#63 (comment)

@esc
Copy link
Member Author

esc commented Oct 22, 2014

  • Carst Vaartjes notifications@github.com [2014-10-19]:

    Hi Francesc (A. :)

    I have some ideas, I will put them on e-mail to you and Valentin later
    this week! It's probably much better if you involve Jeff Reback etc.
    as they only know me from posting obscure Pandas & Pytables bugs.

To be honest, I think it will be incredibly hard to disentangle Pandas
from Numpy (lots of specific optimizations have been done for it) +
see what continuum analytics is doing now. Also to make bcolz perform
you need to do some tricks too (see some of the really great stuff
@esc did in the factorization) and have to keep in mind the chunks for
optimum performance (some algorithms such as counted sort depend on
non-sequential writes, which also gives some issues). But with some
tweaks out-of-core bcolz is extremely close to pandas, and that alone
makes it incredible (and gives it a potential to blast away hadoop
etc.)

Yes, we can't put pandas on top of bcolz. Things aren't modular enough
to allow that.

Anyway, will make a mail later this week!

On the groupby, we're getting really really close. I've extended/adjusted @FrancescElies his code to create a groupby function that can work over multiple columns and also cache factorizations. See: https://github.com/FrancescElies/bcolz/commits/groupsort

import bcolz
from bcolz import carray_ext
import tempfile
import os
import pandas as pd
import numpy as np

projects = [
{'a1': 'build roads',  'a2': 'CA', 'a3': 2, 'm1':   1, 'm2':  2, 'm3':    3.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':   2, 'm2':  3, 'm3':    4.1},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 1, 'm1':   4, 'm2':  5, 'm3':    6.2},
{'a1': 'help farmers', 'a2': 'CA', 'a3': 3, 'm1':   8, 'm2':  9, 'm3':   10.9},
{'a1': 'build roads',  'a2': 'CA', 'a3': 3, 'm1':  16, 'm2': 17, 'm3':   18.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':  32, 'm2': 33, 'm3':   34.3},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 2, 'm1':  64, 'm2': 65, 'm3':   66.6},
{'a1': 'help farmers', 'a2': 'AR', 'a3': 3, 'm1': 128, 'm2': 129, 'm3': 130.9}
]

df = pd.DataFrame(projects * 10000)
print df

prefix = 'bcolz-'
rootdir = tempfile.mkdtemp(prefix=prefix)
os.rmdir(rootdir)  # tests needs this cleared
print(rootdir)
fact_bcolz = bcolz.ctable.fromdataframe(df, rootdir=rootdir)
fact_bcolz.rootdir
self = fact_bcolz

groupby_cols = ['a1', 'a2', 'a3']
fact_bcolz.cache_factor(groupby_cols, refresh=True)  # this caches the factorizations on-disk directly in the rootdir
fact_bcolz.groupby(groupby_cols, {})  # does the first 3 parts of the groupby, see the code

There's room for improvement left of course, with ideally having a metadata later for each column where i can write two carrays (for the factorized index and for the unique values). Also, numexpr cannot handle unsigned int64 unfortunately. But hey, it works! :) It's still slow however compared to Pandas, so more optimizations are required, but i would say: let's make something that works from start to end first, then look at bottlenecks and optimize it

This looks really promising! I would like to profile and optimize this!

@CarstVaartjes
Copy link
Contributor

Hi Esc, let us wrap up the commit first to a nice, quickly performing total (Friday I really hope), and then you can decide how to pick (and what should be and should not be in bcolz).
The good news: we have it fully functioning now! Bad news: still some in-core elements and some issues around performance (hope to solve the biggest bottlenecks by late Friday applying your factorization writing method to other parts too).
If only there was an "auto-create pxd" macro :)

else:
# allocate enough memory to hold the string, add one for the
# null byte that marks the end of the string.
insert = <char *>malloc(allocation_size)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to check for a NULL pointer here. I never did grok cython's malloc wrapper. (but IIRC yes, we need to check it)

@FrancescElies
Copy link
Member

@esc & @CarstVaartjes about the integrating the tests, I have really no preferences, In my opinion you could do it the way it fits best for you, cherry picking, copy paste
I am just glad if you want to take any part :)

@esc
Copy link
Member Author

esc commented Oct 23, 2014

Okay, I saw that my own pull-request is still a bit messy and I'm not sure that will be the pull-request that will end up in the code base proper. However I'd like to start thinking about how to structure the individual pull-requests for the groupby, so that they are as easy to review as possible, i.e. bite-sized chunks (no pun intended). For example, from this pull-request, the klib integration (whatever form that might take) is one thing. The factorize code (in all it's variants, and for all types) is another.

@FrancescElies
Copy link
Member

As somebody told me today, I should not be doing this at this moment :), anyway...

After trying out quite a lot of things and working hard with @CarstVaartjes, using @esc code trying to produce similar things, and having discussions around nice ideas from @esc & @CarstVaartjes for a groupby function, at the end problems were more or less successfully tackled.

We are aware that this is not finished at all, this can be still optimised, better structured, code replication removed and of course needs further testing,

Nevertheless, we would like generate debate around it and see if at the end we can get something nice ;)

Small actual groupby implementation's summary:

  1. Factorize the groupby columns and save the result to disk (saves the factorization result for the groupby columns for further use if groupby is called again saving time)
  2. Compute group index for each observation
  3. Create the aggregation result

The following file https://github.com/FrancescElies/bcolz/blob/groupby/test_speed/groupby_vs_pandas.py shows how at the moment the groupby function is being called, runs a small benchmark and compares the execution time versus pandas (code based on @FrancescAlted sum script #78 (comment)), the correctness of the values needs to visual inspection though.
For testing purposes each of the 'group by' columns has a different type (string, float64, int32 & int64). The comparison to pandas does not take the cache part into a account (cache should only happen once if data suffers no modification).

Benchmark groupby, suppose we want to groupby the following dataframe and aggregate the sum of the different groups (groupby columns a1, a2, a3 & a4)

N = int(1e6)
projects = [
{'a1': 'build roads',  'a2': 1.1, 'a3': 2, 'a4': 2000000000, 'm1':   1, 'm2':  2, 'm3':    3.2},
{'a1': 'fight crime',  'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1':   2, 'm2':  3, 'm3':    4.1},
{'a1': 'help farmers', 'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1':   4, 'm2':  5, 'm3':    6.2},
{'a1': 'help farmers', 'a2': 1.1, 'a3': 3, 'a4': 3000000000, 'm1':   8, 'm2':  9, 'm3':   10.9},
{'a1': 'build roads',  'a2': 1.1, 'a3': 3, 'a4': 3000000000, 'm1':  16, 'm2': 17, 'm3':   18.2},
{'a1': 'fight crime',  'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1':  32, 'm2': 33, 'm3':   34.3},
{'a1': 'help farmers', 'a2': 1.2, 'a3': 2, 'a4': 2000000000, 'm1':  64, 'm2': 65, 'm3':   66.6},
{'a1': 'help farmers', 'a2': 1.3, 'a3': 3, 'a4': 3000000000, 'm1': 128, 'm2': 129, 'm3': 130.9}
]
df = pd.DataFrame(projects * N)
> python test_speed/groupby_vs_pandas.py

reference
                                       m1         m2            m3
a1           a2  a3 a4
build roads  1.1 2  2000000000    1000000    2000000  3.200000e+06
                 3  3000000000   16000000   17000000  1.820000e+07
fight crime  1.2 1  1000000000   34000000   36000000  3.840000e+07
help farmers 1.1 3  3000000000    8000000    9000000  1.090000e+07
             1.2 1  1000000000    4000000    5000000  6.200000e+06
                 2  2000000000   64000000   65000000  6.660000e+07
             1.3 3  3000000000  128000000  129000000  1.309000e+08
-->  Pandas groupby 1.396 sec

-->  Bcolz caching 1.801 sec

[('build roads', 1.1, 2, 2000000000, 1000000, 2000000, 3200000.0000426522)
 ('fight crime', 1.2, 1, 1000000000, 34000000, 36000000, 38400000.00080768)
 ('help farmers', 1.2, 1, 1000000000, 4000000, 5000000, 6200000.0001121415)
 ('help farmers', 1.1, 3, 3000000000, 8000000, 9000000, 10900000.000203883)
 ('build roads', 1.1, 3, 3000000000, 16000000, 17000000, 18199999.99965895)
 ('help farmers', 1.2, 2, 2000000000, 64000000, 65000000, 66600000.0010485)
 ('help farmers', 1.3, 3, 3000000000, 128000000, 129000000, 130900000.00236544)]
-->  Bcolz groupby 2.333 sec

1.671 x times slower than pandas (cache time not taken into account)

The profiler looks like this (sum and looping seems to be consuming most of the time)

In [7]: %prun -l 10 -s cumulative fact_bcolz.groupby(groupby_cols, agg_list)
         24218115 function calls in 6.557 seconds

   Ordered by: cumulative time
   List reduced from 152 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    6.557    6.557 ctable.py:1289(groupby)
        1    0.000    0.000    6.323    6.323 {bcolz.carray_ext.aggregate_groups}
        1    0.008    0.008    6.323    6.323 carray_ext.pyx:3025(aggregate_groups)
       21    2.992    0.142    5.959    0.284 carray_ext.pyx:3014(agg_sum)
 24000049    2.083    0.000    3.024    0.000 carray_ext.pyx:2449(__next__)
     1369    0.204    0.000    0.970    0.001 carray_ext.pyx:1862(__getitem__)
     4376    0.713    0.000    0.713    0.000 carray_ext.pyx:465(_getitem)
     2990    0.026    0.000    0.541    0.000 carray_ext.pyx:504(__getitem__)
        8    0.005    0.001    0.444    0.055 chunked_eval.py:81(eval)
        8    0.010    0.001    0.438    0.055 chunked_eval.py:157(_eval_blocks)

The code can be found at https://github.com/FrancescElies/bcolz/tree/groupby

Again a long post, it can't be helped sorry :)

@CarstVaartjes: if I am missing something please please feel free to add.

@esc & @FrancescAlted thanks a lot for keep reading & responding our non ending questions

@CarstVaartjes
Copy link
Contributor

Maybe it's good to summarize it:

  • We have a fully functioning groupby now (https://github.com/FrancescElies/bcolz/tree/groupby)
  • Most parts are out-of-core, only the hashing in factorization isn't but that should not give issues except for extreme cases
  • We have build a method to cache the factorization for disk based ctables
  • Without caching in our specific example the relative performance of out-of-core bcolz vs in-mem pandas is 2.5-3.0x
  • With caching out-of-core bcolz vs in-mem pandas is only 1.4-1.8x slower <- this is extremely impressive!
  • I will build in a bool array sub-selection method this weekend (so you can combine a where clause and a groupby directly together) after that I would say that groupby version 0.1 is finished and ready for production testing :)
  • There are probably quite a lot of improvements that can still be made (extensions, performance improvements) but we need some guidance from you guys there. I would make that a version 0.2
  • In terms of structure (what should be where etc.) the setup is a bit hampered by the current setup of bcolz (no pxd, etc.) so we cannot make it a separate module on top of bcolz yet, but there is no theoretical reason why it would not be feasible.

Please let us know how you guys want to proceed from here (integrate it, have it as a separate function, the relationship to pandas, blaze, etc. etc.) -> shall I make a mail to the list for a discussion?
We ourselves are testing this version atm and will switch part of our own system from in-mem pandas services to out-of-core bcolz (which will make it a lot more scaleable etc.)

@FrancescAlted
Copy link
Member

Thanks Valentin for the summary because I was a bit lost with so many PRs
and tickets (for example, I doubt that #78 was really necessary, unless we
are entering an era where mailing lists are really a thing of the past). I
continue to think that we should use more the mailing list for having this
discussion (call me out of fashion if you want :). Valentin could you
please start a thread there so that we can start discussing the different
problems that you want to tackle?

Thanks again!

2014-10-25 11:49 GMT+02:00 Carst Vaartjes notifications@github.com:

Maybe it's good to summarize it:

  • We have a fully functioning groupby now (
    https://github.com/FrancescElies/bcolz/tree/groupby)
  • Most parts are out-of-core, only the hashing in factorization isn't
    but that should not give issues except for extreme cases
  • We have build a method to cache the factorization for disk based
    ctables
  • Without caching in our specific example the relative performance of
    out-of-core bcolz vs in-mem pandas is 2.5-3.0x
  • With caching out-of-core bcolz vs in-mem pandas is only 1.4-1.8x
    slower <- this is extremely impressive!
  • I will build in a bool array sub-selection method this weekend (so
    you can combine a where clause and a groupby directly together) after that
    I would say that groupby version 0.1 is finished and ready for production
    testing :)
  • There are probably quite a lot of improvements that can still be
    made (extensions, performance improvements) but we need some guidance from
    you guys there. I would make that a version 0.2
  • In terms of structure (what should be where etc.) the setup is a bit
    hampered by the current setup of bcolz (no pxd, etc.) so we cannot make it
    a separate module on top of bcolz yet, but there is no theoretical reason
    why it would not be feasible.

Please let us know how you guys want to proceed from here (integrate it,
have it as a separate function, the relationship to pandas, blaze, etc.
etc.) -> shall I make a mail to the list for a discussion?
We ourselves are testing this version atm and will switch part of our own
system from in-mem pandas services to out-of-core bcolz (which will make it
a lot more scaleable etc.)


Reply to this email directly or view it on GitHub
#63 (comment).

Francesc Alted

@CarstVaartjes
Copy link
Contributor

Actually both #78 and #76 can be closed now; this pull as well I think -> once you have decided on how you want to integrate functionality we will prepare that and make an entirely new pull.
There was also quite a bit of e-mail traffic on the side as well, next to all of this and I understand how this is not really an "issue" but more a discussion and should be put into the mailing list :)
I'll make a mail and let's see how you want to take it from there.

@FrancescElies
Copy link
Member

My apologies if open too many tickets, open source field is new to me

@esc
Copy link
Member Author

esc commented Dec 21, 2014

What should we do with this issue? Perhaps in future we can use the mailinglist for any architectural and planning discussions and reserve using issues to track bugs and fixes.

@CarstVaartjes
Copy link
Contributor

Good point, this got a bit more out of hand than we thought! Let's do the rest through the mailing list, in short: it works (we've gone into production with the BCOLZ based groupby, works really well!) and build of the following 3 blocks:

  1. carray_ext pxd extension so we can have a more modular approach (existing pull request)
  2. basic carray functionality: factorization
  3. ctable groupby functionality

Let's see how we can do that there

@CarstVaartjes
Copy link
Contributor

This one can be closed btw, as it's covered in bquery now

@esc
Copy link
Member Author

esc commented Feb 28, 2015

Sure.

@esc esc closed this Feb 28, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants