-
Notifications
You must be signed in to change notification settings - Fork 150
Conversation
The first implementation
Also, here is the profiling output:
It seems like the |
Added an optimization (
And the corresponding profile:
I think that is a pretty nice achievement over the naive approach, but can probably do better yet still. |
And with profiling for cython active we get:
|
I made a cython version too but it's only marginally faster than the second pure python approach, but probably one can tweak the cython some more. |
I tweaked the code some more and am now at a factor of 2 slower than pandas:
|
Hi, with the cython code "independentized" from Pandas (how do you say that haha), the klib upgrade seems to give a 20% increase for in64 and a 300% increase for strings:
|
@CarstVaartjes that is pretty good. I'll perhaps try to use the klib hashtable in bcolz soon. The difference at the moment is that the 'labels' returned are a Numpy array. for bcolz this should perhaps be a carray? So for example you could store the integer labels as another column in the ctable. As a side note, you are using an int64 for the labels. Depending on the cardinality of the categorical/factor you may get away with an unsigned smaller version of the data, e.g. uint8. You could compute the number of unique values ahead of time,or alternatively downcast the array after you have create the labels, since the size of the hashtable will tell you what size of integer you will need. Either one will however double the runtime but decrease the space needed. For a carray however, due to the shuffle filter in blosc, that shouldn't matter much since all the zeros in the large integers will compressed away. Another interesting read would be the recent addition of the Categorical datatype to Pandas: https://pandas-docs.github.io/pandas-docs-travis/categorical.html Effectively |
Hi Valentin, we'll include it in #62 and yes want to convert it there to a version that uses a carray version! Main issue is that there is no carray_ext pxd file yet (giving headaches to develop separate cython files) + I worry about writing to a carray sequentially and then sorting it (because of the compressing performance penalty it triggers while randomly writing) On categorical: we ourselves do something exactly like that in our architecture (we work with separate reference tables that factorize all datetime and string objects into int64 identifiers before writing things to hdf5 and in the near future hopefully bcolz) (we are a small startup that does data analytics for retailers and fmcg manufacturers). This might be a nice setup in general, i'm not sure of the python object implementation in pandas isn't overdoing it while a "translated array" (with the integers) and a "translation array" with the value as separately saved arrays might be much better |
I did do some more experimentation by using the kash stuff you provided and it turns out to be quite alot faster than the standard python dict:
And we are now only about 30ms slower than pandas on this particular example. |
Stragely enough using typed numpy arrays, we become faster than pandas:
Though I'm not sure if what I am seeing is correct it is getting quite late and I fear I made some kind of error somewhere. Anyway, I'll be double-checking the results again soon. |
It's quicker than Pandas because it's a updated version of klib, see: pandas-dev/pandas#8524 |
@CarstVaartjes ah I see, so I'll wait for the new klib to make it into pandas and then test again, but good to know that it might be comparable. |
trying to catch up with your current code base, I did some tests with klib 0.2.8 and the old 0.2.6, both lead to very similar results, factorize from @esc is at least for me the new klib as fast as the old one with this example.
|
Hi Valentin (@esc) do you also get this result for bcolz factorize or am I missing something? Thanks
|
@FrancescElies No that seems to be foobared. I'm double checking it now. BTW: how did you swap in-out the klib? |
I actually get a slightly differnt result:
And after inserting some print statements, it seems like NY and IL actually hash to the same value. |
Switching back to the python dictionary gives correct results once again:
|
I can see why you would want to have a split between the core and the analytics part (maintenance, bloat, etc.); some things which are different or have to look at from my humble PoV:
Just some thoughts. Probably the earlier this would happen the better right? So the pxd is then really high prio? |
@esc you can find the tests in You can also find I also started having a look to groupsort from pandas At the moment I am having a look at the infos you sent, I am open to any comments or things you would like to me to change in case you were interested, thanks! |
@esc, inside factorize_cython would it be a good idea to type the reverse parameter as carry? Though should normally not be the case of having all unique values, potentially this dict could be also quite big in some particular cases. |
@FrancescElies I had been thinking about that. In principle the lookup is indexed and ordered so it is possible to replace the dict with a carray. Also, since carray appends are (should be) lighting fast for the most part you could simply append to the carray every time you encounter a new unique value w/o much overhead. However only benchmarking will tell about this performance. In my first initial implementation of factorize I noticed that there may be some speed issues with appending scalars to carrays. That will most probably have to be looked at. |
edit: oops, misread the whole thing. the helper was about reading a chunk at a time, not writing it (so the stuff below doesn't make sense) btw about the issue with appends in chunks in the _factorize_helper_with_counts, isn't this a possible solution?
(this is based on some similar issues we had inside our algorithms with saving up for hdf5 writes in our current solution). Then you can just directly append to the item each time and it will take care of efficient saving itself.
|
Ok so I tried that and while it worked nicely, it made it twice as slow ;) oops! |
Just to give my opinion on what @esc is wondering, yes, my intention was to keep bcolz really minimal, and my hope was to find in PyToolz/CyToolz a nice counterpart for taking care of the analytics part, but my hopes really vanished as soon as you shown me that you can easily get a 10x increase in performance by implementing groupby internally in bcolz. Frankly, I don't know where bcolz should go. My initial intention was indeed to keep it simple (I knew that I was not going to have too much time to put on it), but on the other hand, performance is important as well, so perhaps integrating more functionality on it is good. Problem is, at which point should we decide to stop adding high performance functionality? And, if we not put a limit, perhaps are we bounded to replicate pandas but using carray/ctable bcolz objects? If so, it would not be better to make pandas to work on top of bcolz? Hmm, we are facing an important decision here, and I don't have all the answers. For the time being though I would like to move this discussion to the mailing list, and perhaps involve people from pandas and other related projects to participate and express their thoughts there. At any rate, I really like that this discussion is happening and encourage you to continue providing suggestions on how bcolz can be improved. |
Hi Francesc (A. :) I have some ideas, I will put them on e-mail to you and Valentin later this week! It's probably much better if you involve Jeff Reback etc. as they only know me from posting obscure Pandas & Pytables bugs. On the groupby, we're getting really really close. I've extended/adjusted @FrancescElies his code to create a groupby function that can work over multiple columns and also cache factorizations. See: https://github.com/FrancescElies/bcolz/commits/groupsort
There's room for improvement left of course, with ideally having a metadata later for each column where i can write two carrays (for the factorized index and for the unique values). Also, numexpr cannot handle unsigned int64 unfortunately. But hey, it works! :) It's still slow however compared to Pandas, so more optimizations are required, but i would say: let's make something that works from start to end first, then look at bottlenecks and optimize it |
I'll answer in-line to above using my mailer. Also, I would tent to agree with @FrancescAlted to move any more lengthy discussions to the mailinglists. That way we can use the threading inherent to email in case we need to spin off tangents, the github interface really sucks at this kind of stuff and isn't really suited for more than a few simple replies back and forth. |
Hi,
Yes, perhaps the ctable needs to move then too? Or everything get's
Hehe, yes.
True. Requires the klib stuff though if it wants to be fast.
Yeah, that sucks, maybe we need to hash the contents somehow to a file
Absolutely. Modular deisgn does have it's benefits. But drawbacks too,
That would be nice. I have a bloscpack S3 backend in the mental
Well, the pxd file would be awesome for sure, the cython code seems a |
How do you want me to integrate this? Can I cherry-pick the commit and best, V- |
Yeah, I am not sure what is going on there. If the thing to append is a
|
Yes, we can't put pandas on top of bcolz. Things aren't modular enough
This looks really promising! I would like to profile and optimize this! |
Hi Esc, let us wrap up the commit first to a nice, quickly performing total (Friday I really hope), and then you can decide how to pick (and what should be and should not be in bcolz). |
else: | ||
# allocate enough memory to hold the string, add one for the | ||
# null byte that marks the end of the string. | ||
insert = <char *>malloc(allocation_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to check for a NULL pointer here. I never did grok cython's malloc wrapper. (but IIRC yes, we need to check it)
@esc & @CarstVaartjes about the integrating the tests, I have really no preferences, In my opinion you could do it the way it fits best for you, cherry picking, copy paste |
Okay, I saw that my own pull-request is still a bit messy and I'm not sure that will be the pull-request that will end up in the code base proper. However I'd like to start thinking about how to structure the individual pull-requests for the groupby, so that they are as easy to review as possible, i.e. bite-sized chunks (no pun intended). For example, from this pull-request, the klib integration (whatever form that might take) is one thing. The factorize code (in all it's variants, and for all types) is another. |
As somebody told me today, I should not be doing this at this moment :), anyway... After trying out quite a lot of things and working hard with @CarstVaartjes, using @esc code trying to produce similar things, and having discussions around nice ideas from @esc & @CarstVaartjes for a groupby function, at the end problems were more or less successfully tackled. We are aware that this is not finished at all, this can be still optimised, better structured, code replication removed and of course needs further testing, Nevertheless, we would like generate debate around it and see if at the end we can get something nice ;) Small actual groupby implementation's summary:
The following file https://github.com/FrancescElies/bcolz/blob/groupby/test_speed/groupby_vs_pandas.py shows how at the moment the groupby function is being called, runs a small benchmark and compares the execution time versus pandas (code based on @FrancescAlted sum script #78 (comment)), the correctness of the values needs to visual inspection though. Benchmark groupby, suppose we want to groupby the following dataframe and aggregate the sum of the different groups (groupby columns a1, a2, a3 & a4) N = int(1e6)
projects = [
{'a1': 'build roads', 'a2': 1.1, 'a3': 2, 'a4': 2000000000, 'm1': 1, 'm2': 2, 'm3': 3.2},
{'a1': 'fight crime', 'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1': 2, 'm2': 3, 'm3': 4.1},
{'a1': 'help farmers', 'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1': 4, 'm2': 5, 'm3': 6.2},
{'a1': 'help farmers', 'a2': 1.1, 'a3': 3, 'a4': 3000000000, 'm1': 8, 'm2': 9, 'm3': 10.9},
{'a1': 'build roads', 'a2': 1.1, 'a3': 3, 'a4': 3000000000, 'm1': 16, 'm2': 17, 'm3': 18.2},
{'a1': 'fight crime', 'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1': 32, 'm2': 33, 'm3': 34.3},
{'a1': 'help farmers', 'a2': 1.2, 'a3': 2, 'a4': 2000000000, 'm1': 64, 'm2': 65, 'm3': 66.6},
{'a1': 'help farmers', 'a2': 1.3, 'a3': 3, 'a4': 3000000000, 'm1': 128, 'm2': 129, 'm3': 130.9}
]
df = pd.DataFrame(projects * N) > python test_speed/groupby_vs_pandas.py
reference
m1 m2 m3
a1 a2 a3 a4
build roads 1.1 2 2000000000 1000000 2000000 3.200000e+06
3 3000000000 16000000 17000000 1.820000e+07
fight crime 1.2 1 1000000000 34000000 36000000 3.840000e+07
help farmers 1.1 3 3000000000 8000000 9000000 1.090000e+07
1.2 1 1000000000 4000000 5000000 6.200000e+06
2 2000000000 64000000 65000000 6.660000e+07
1.3 3 3000000000 128000000 129000000 1.309000e+08
--> Pandas groupby 1.396 sec
--> Bcolz caching 1.801 sec
[('build roads', 1.1, 2, 2000000000, 1000000, 2000000, 3200000.0000426522)
('fight crime', 1.2, 1, 1000000000, 34000000, 36000000, 38400000.00080768)
('help farmers', 1.2, 1, 1000000000, 4000000, 5000000, 6200000.0001121415)
('help farmers', 1.1, 3, 3000000000, 8000000, 9000000, 10900000.000203883)
('build roads', 1.1, 3, 3000000000, 16000000, 17000000, 18199999.99965895)
('help farmers', 1.2, 2, 2000000000, 64000000, 65000000, 66600000.0010485)
('help farmers', 1.3, 3, 3000000000, 128000000, 129000000, 130900000.00236544)]
--> Bcolz groupby 2.333 sec
1.671 x times slower than pandas (cache time not taken into account) The profiler looks like this (sum and looping seems to be consuming most of the time)
The code can be found at https://github.com/FrancescElies/bcolz/tree/groupby Again a long post, it can't be helped sorry :) @CarstVaartjes: if I am missing something please please feel free to add. @esc & @FrancescAlted thanks a lot for keep reading & responding our non ending questions |
Maybe it's good to summarize it:
Please let us know how you guys want to proceed from here (integrate it, have it as a separate function, the relationship to pandas, blaze, etc. etc.) -> shall I make a mail to the list for a discussion? |
Thanks Valentin for the summary because I was a bit lost with so many PRs Thanks again! 2014-10-25 11:49 GMT+02:00 Carst Vaartjes notifications@github.com:
Francesc Alted |
Actually both #78 and #76 can be closed now; this pull as well I think -> once you have decided on how you want to integrate functionality we will prepare that and make an entirely new pull. |
My apologies if open too many tickets, open source field is new to me |
What should we do with this issue? Perhaps in future we can use the mailinglist for any architectural and planning discussions and reserve using issues to track bugs and fixes. |
Good point, this got a bit more out of hand than we thought! Let's do the rest through the mailing list, in short: it works (we've gone into production with the BCOLZ based groupby, works really well!) and build of the following 3 blocks:
Let's see how we can do that there |
This one can be closed btw, as it's covered in bquery now |
Sure. |
No description provided.