Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

Consider adding a Categorical datatype to bcolz #66

Open
esc opened this issue Oct 10, 2014 · 11 comments · May be fixed by #187
Open

Consider adding a Categorical datatype to bcolz #66

esc opened this issue Oct 10, 2014 · 11 comments · May be fixed by #187

Comments

@esc
Copy link
Member

esc commented Oct 10, 2014

To be inspired by:

https://pandas-docs.github.io/pandas-docs-travis/categorical.html

@CarstVaartjes
Copy link
Contributor

That's sort of what the factorization does btw (though there are no special functions for categorical access yet)

@esc
Copy link
Member Author

esc commented Dec 30, 2014

Yes, the categorical is the result of a factorization. IIRC in R a Categorical type is actually called factor.

@ARF1
Copy link

ARF1 commented Apr 18, 2015

I would be quite interested in this as well.

Rationale: pandas & bcolz string columns is currently extremely slow, as each entry in the string column is converted to a python string object. It also sounds like a tractable solution to #174.

It does not appear that a fixed length pandas dtype is likely to be available any time soon since the work required seems to be non-negligible: pandas-dev/pandas#5261

@mrocklin is working around this deficit in his dask.dataframe project by using the pandas 'category' dtype.

The downside to this solution is that when reading a bcolz string column, the column has to be factorized for every read. It would be nice if the factorization could be stored with the data.

The visualfabriq/bquery project has the "machinery" required for a categorical dtype in place. As mentioned the api for categorical access is still missing.

How should the access functions behave?

My gut says:

  • keep the factorization (and un-factorization) machinery external (use pandas, bquery, or similar)
  • getitem:
    • for numpy output type: return the factor & provide a new function to access the category data
    • for a (new) pandas output type: return a dataframe / series with pandas dtype 'category'
  • setitem: similarly

Does this sound like a sensible idea?

Is anybody else interested in this issue as well?

@mrocklin
Copy link
Contributor

There is a trade off here. I think that categoricals are fantastic. However I also appreciate that bcolz has a simple data model that exactly mirrors NumPy.

I like the bquery approach of building projects on top of bcolz because it allows bcolz itself to remain very simple.

@esc
Copy link
Member Author

esc commented Apr 19, 2015

I think the categorical type can be implemented as an additional type in an external library, given the Cython-API it would be possible.

Some notes on the implementation:

The performance of a categorical type depends very much on the "load-factor" i.e. the ratio of unique types to total number of elements. This influences how to store it in-memory and out-of-core.
I am note yet sure how to store the categories and if we should have a two way lookup.

Depending on the number of categories, it may be wise to deactivate the shuffle filter. For example if you "only" have < 256 categories, it may or may not make any difference. Anything beyond that depends on both the entropy and the order of the data. For example, if you have 257 categories but the element that maps to integer 256 appears only once in your >> 257 elements dataset, using the shuffle filter should give some nice compression ratios. If on the other hand your coresponding values are sampled uniformly from the closed interval 0-65536 (short/int16) then the shuffle filter is unlikely to give you much advantage.

@CarstVaartjes
Copy link
Contributor

I would like to have a categorical type too; of course it's up to @FrancescAlted and @esc , but I think in light of previous discussions, we can always add it to bquery (to keep bcolz focused + it's exactly the kind of thing that we're trying to add with bquery).
Of course, bquery needs some general love too at the moment (making it pip installable, parallelization, additional statistical methods, sort methods, etc.), but I think most of the building blocks for a categorical type are there.
Having read the pandas categorical doc, I'm a bit surprised at some of the functionality:
The fact that certain operations are only possible when the category is ordered (which most of the times you do not want to do, as often the column is part of a larger table)
How they are setting rows to NaN -> for me / bquery NaN atm technically is just another factorization value.
I need to read more into r's factor too I guess :)

For now it would mean:

  • Add the categorical type (and maybe default to this for string columns?)
  • Better add metadata to the carray (now we do this a bit "dirty" without writing proper metadata" and we now would need to start doing this in a more future proof way); also: atm bquery factorization is heavily oriented towards the disk-based version only
  • Auto-factorize categorical columns while creating them or appending data to an existing column (which also needs checks as "is the dtype the same as the categorical carray")
  • When adding statistical methods, check whether it's a categorical column and if so, handle it differently

Some reads that might be interesting for @ARF1 and @mrocklin to understand why we made bquery in the first place. The original discussion document based on which we started bquery: http://www.slideshare.net/cvaartjes/bcolz-groupby-discussion-document
And how my own company uses it: (I hope to open source this by the end of this year, main thing there is to disentangle propietary stuff with the general architecture):
http://www.slideshare.net/cvaartjes/bquery-architecture

@ARF1
Copy link

ARF1 commented Apr 19, 2015

This issue would probably benefit from being considered together with the possibility of introducing a pandas out_flavor to ctable (though not necessarily for carray). (See #176)

A pandas out_flavor would not only offer good support for categoricals but also improve ctable performance significantly.

@mrocklin I appreciate the rationale for staying with the numpy data model but the possible performance improvement with a pandas out_flavor merits consideration I think. And if a pandas out_flavor was introduced for performance reasons, supporting categoricals would be a no-brainer, no?

@esc Going the bquery route with the "categoricals / pandas out_flavor"-combination would be an option but I think it would require reimplementation of large sections of ctable. Integrating into bcolz look like much less effort since the "pandas DataFrames" and "numpy structured array" data access models are virtually identical. Only small sections that instantiate the output structure appropriately would be needed (at a first glance).

@CarstVaartjes We seem to have the same use case: fincancial data analysis. Is your analysis thus also predominantly along rather than across columns? How do you deal with the inefficiencies resulting from the the row-major ordering of the structured array? Cheap computing power? Or are your filling your pandas DataFrames directly from carray and avoid ctable altogether for data access?

@FrancescAlted
Copy link
Member

Hi all,

Thanks for the detailed discussions. I am not able to dig into this a lot now, but I am planning to put more time on this in the near future. Just to be short: I like categoricals too, but that introduces complexity and compression alleviates situations where your cardinality is low, so that needs more discussion. I like the idea of using pandas as an additional flavor to output queries from bcolz, but we should explore if a dictionary of numpy arrays can make the thing too. If pandas can ingest dictionaries of numpy arrays cheaply, then that could be the way to go.

I know that you are discussing other things too and I appreciate that, but just to be clear, I would like bcolz to not become the monster that for example PyTables has gone. Falling on the side of complexity sometimes might seem appealing, but we should fight hard against that temptation and try to keep things simple.

As I said, expect me becoming more active here in the short future. Thanks.

@ARF1
Copy link

ARF1 commented Apr 19, 2015

@FrancescAlted Thanks for taking the time to respond. I appreciate the desire to keep things simple and maintainable.

Categorical dtype issue:
As you suggested I currently use compression to deal with low cardinality string data. This works extremely well from a storage stand-point. Where it breaks down is in the interaction with pandas. Pandas does not really do strings. I thus now have two choices when interfacing with pandas:

  1. I factorise the string columns every single time I read from the ctable, which keeps complexity low but absolutely kills performance, or
  2. I use bquery factorization, avoid ctable __getitem__() altogether for reading the data and load the factorization manually. This leads to messy code and is wasteful on storage since I am carrying around the original data + the factor + the "index?"/"key?".

In short: to me categoricals as relating to bcolz are only really an issue with strings. Compression will take case of the other dtypes in my use cases. Maybe limiting categoricals to strings would change the complexity vs. utility analysis. I currently do not have a clear enough idea on how categoricals would be implemented to asses this.

Row-major ordering issue:
A dictionary of numpy arrays does indeed solve the row-major ordering issue. I did some tests before I discovered pandas already uses column-major order. I did not propose it because it would would change the data access model and I had gotten the impression that that was a big no-go with bcolz:

Currently projection to a subset of the columns is identical with numpy and ctable: mydata[['col1', 'col2']] does the trick. If ctable were to return a dict of numpy arrays, this would no longer be true:

In [9]: test = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}

In [10]: test[['col1', 'col2']]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-ac76cd049698> in <module>()
----> 1 test[['col1', 'col2']]

TypeError: unhashable type: 'list'

While a pandas out_flavour would preserve the api:

In [32]: test = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}

In [33]: test1 = pd.DataFrame(data=test)

In [34]: test1[['col1', 'col2']]
Out[34]:
   col1  col2
0     1     4
1     2     5
2     3     6

If that break in the api is not an issue, a dict of arrays would be great for pandas: the built-in functions can ingest them fairly cheaply.

"Fairly" cheaply because:

  1. it has the DataFrame instantiation overhead (which ctable could probably cache), and
  2. it copies the arrays into a newly created array (pandas collates columns with identical dtypes into something it calls "blocks").
    • The copying however will be efficient as it will copying contiguous data.
    • Of course direct decompression into a pre-allocated DataFrame would be even faster but that is the complexity vs. maintainability tradeoff again. I cannot imagine the gains would be worth the effort: but profiling keeps surprising me...

@CarstVaartjes
Copy link
Contributor

Hi @ARF1, about your questions: we also use columns pre-dominantly (we do more fmcg & retail stuff then financial, but for example we have sets with 2 billion records of retail sales). However we do not really run into major inefficiencies in bquery itself, as that really does per-column operations (see the slideshare presentation I mentioned).
We do use pandas too (for some of the nicer functionality in there, and als well statsmodels & scikit-learn), but there we use the ctable todataframe functionality. As far as I can see, it already translates nicely directly to a dataframe? ->

def todataframe(self, columns=None, orient='columns'):

@ARF1
Copy link

ARF1 commented May 5, 2015

In PR #187 I propose a new abstraction layer for the generation of the "results array" in ctable.__getitem__() and access to it.

This would allow everybody to provide their own out_flavor implementations, e.g. with pandas or categorical dtype, while minimizing impact on bcolz code and maintainability.

Overhead is fairly low: 42.3µs vs. 38.5µs for returning a single-row result from my test data.

I would love to know what you think.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants