-
Notifications
You must be signed in to change notification settings - Fork 147
Conversation
I have given a quick look into your PR (btw, thanks for sending it). I need some time to digest that (I am swamped lately), but some quick comments:
At any rate, I will try to put more time in this PR later, but first I would like to be convinced that we are not reinventing the wheel ;) |
Happy to contribute!
We did quite a few other approaches too, but the cytoolz was relatively quick but still much less than Pandas (which is really really quick). So given a bit more time, we might be able to learn from Pandas and improve this here. Maybe we can take both of this step-by-step... Edit: it was a disk based bcolz actually (which makes it rather impressive); an in-mem bcolz was a bit but not much quicker |
added dict constructor
I had a look at this code a couple of times over the last few days and I am finding it fairly tricky to digest and review, mainly due to the size of the implementation and a little bit also the structure. When I think of 'groupby implementation' I usually think of the following article: http://wesmckinney.com/blog/?p=489 And it explains quite well how groupby is a combination of -- loosely speaking -- counting sort and factorize. Your groupby implementation sort of seems to be doing this, at least it seems to be composed of several steps. Perhaps one of the ways to make this code easier to comprehend, which is in the interest of long-term maintenance, you could refactor the individual steps into separate functions? This would decrease coupling and increase potential for reuse of code within bcolz itself. Also, the groupby only implements 'sum' at the moment. From what I understand, pandas has a GroupBy object that you can then call the desired aggregation on. That would be one way to go here too to implement others. Some food for thought, how might we implement counting sort efficiently on in-memory but also out-of-core chunked compressed storage. Obviously, the first step of counting sort, which just goes through linearly and counts how many items there are in each category using a hashtable, is pretty trivial, assuming there are significantly less categories then items. However the second step, which is also O(n) which actually sorts the elements will be somewhat more tricky than if you had a contiguous memory buffer, because you need random access for the output. Perhaps one needs to preallocate compressed chunks with zeros (will compress super nicely) and then use a clever system of caching decompressed chunks. So for example, several 'open' chunks might reside decompressed in memory in a type of cache. If another chunks must be decompressed to be written into, one of the decompressed chunks is 'closed', meaning that it will be compressed and put in 'storage' (in-mem or disk). Interesting will be of course the selection of a block for cache 'eviction', several approaches (LRU, current fill-status, etc...) come to mind. For the where_terms implementation it seems to be implementing some sort of syntax parser? Not sure if that is such a good idea, and if it is, perhaps it should exist in it's own right, again for purposes of reuse. Lastly, on a personal note, I think that It will be very interesting to see how an efficient groupby implementation for bcolz will work, so keep on pushing! 😄 |
edit: this is a good read too: http://www.slideshare.net/wesm/a-look-at-pandas-design-and-development (page 32 and on)
Any ideas/comments/feedback is really welcome. I'm aware that this might be extremely delirious and that there might be better approaches |
Not to wanting to spam everyone, but at least for me & @FrancescElies I tried to follow Pandas' logic to understand it better. (and i think it's good for the discussion to put in here). As you'll see, Francesc Elies & me are not the greatest cython/c experts, but I think we can improve on the current push. Pandas is a bit complicated in terms of its construction. (One general thing, is that the fact that Pandas removes NaN groupby column values, is something I do not think normally should happen) It imports klib:
Then we have the hashtable pandas routines:
The self.table is the hashtable from khash (to which the current type corresponds):
So its a nice way to get for each row an idea of what the unique values are, how often they are there and which row relates to which value. For an out-of-core implementation the dict would have to be replaced by an array (with for each row containing the original value belong the the index of that row), which theoretically would start with the size of the length of the input (because each row could have an unique value) and then has to be resized at the end to the actual size So all peachy / understandable up to here, but then we get into the groupby.py territory and stuff gets really complicated/long (3700 lines with lots of references, series vs df vs panel paths, etc.):
and we get something like:
Later on Pandas again refactors them in a smart way (it calculates the theoretical position on a cartesian joint product and filters out non-existing combinations)
nb: I don't completely understand why the actual sparse matrix needs to be made here? -> you could just zip through the columns together and calculate the group_index for each row...
Then a bit later on in the code we get to the aggregation where we do this: agg_func(result, counts, values, comp_ids), basically:
The agg_funcs are in algo.py, but I'm not exactly how it loops through the comp_ids (but by looping zipped throuh the comp_ids and counts you can sub_select the values each time using indexes I guess) |
This is all quite alot to digest in a single sitting. I'm just scratching the surface, but I did make a factorize implementation here: #63 Regarding the removal of the NaN values, I think it depends on what semantics you want. For example it common for datasets to be messy and for certain values to be missing. If you then try to compute things with such a sequence you run into issues, for example a sum of a sequence with a single NaN in Numpy will yield NaN as the answer. |
We had a closer look to Wes McKinney's blog (a link you posted before), pandas uses klib (https://github.com/attractivechaos/klib) for their factorize (hash table-based) and we proceeded to try to do like they do. At the moment we are trying to make the latest version of klib available to python as an independent implementation, (pandas' implementation uses an older version of klib, but everything we are doing is based on their work). Still not finished, but in case you would like to have a look at what we are doing at the moment you can go to https://github.com/CarstVaartjes/khash |
@esc Hi Valentin, the khash implementation from https://github.com/CarstVaartjes/khash is now working (thx to @javiBi4 and @FrancescElies), you can try it like this:
it needs some code clean up + the api is a bit too straight pandas-specific still, but it works as a separate module now + we upgraded it to klib 0.28 (which should be faster than klib 0.26 due to a switch to quadratic probing. 0.26 is in Pandas currently (i'll drop them a mail too that this might be interesting to them too) |
@esc would it be okay with you if we do this step-by-step, so first we make a numpy based solution (using that khash factorize function + counted sorting with an aggregation function) and then we can see how to figure out how to make it out-of-core carray based? |
@CarstVaartjes sure, choose whatever path is best for you. We can always clean up the git history before the final merge of the feature. In paralle,l I'll continue exploring ideas I have for the factorize and groupsort algorithms on compressed arrays. I think there are some things there that we can optimize on. |
I think this is now superceded by #63 btw |
@CarstVaartjes okay, I'll close this issue. |
Dear Bcolz-team,
first of all, thank you for your efforts creating and maintaining this project, my colleague @CarstVaartjes and me are very interested in this amazing data container.
We proceeded to implement an extra functionality we would like to have (see #57 & other related issues #61)
We tried our best but since we are nor C neither Python experts we know the solution we are proposing here can be optimized, we tried to make an ctable_ext.pyx file for this extra functionality but ran into issues (no pxd file for carrays made it really difficult). That is the main reason for having extended the carray_ext.pyx file.
Apologies in advance for any code or things not meeting the standards of the project, we are open to criticism and would love to improve the code.
Basically we added two things:
The groupby clause is largely based upon cytoolz, is okay with performance but is still far off from Pandas; so this functionally works, but it might be nice for future improvements to look at Pandas' hashing & grouping for improving the code.
Please let us know any comments you could have about this request.
Thanks & best regards.
Francesc & Carst