You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think we've discussed this, but I don't remember the conclusion and can't find an issue now.
We recommend from_pandas as the way "most users" should construct tabmat objects. from_pandas then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, since
R users (including many economists) like using formulas, and
It's easy to infer from a Patsy formula which columns are categorical, which are sparse (generally interactions with categoricals), and which are dense (everything else), so this could remove some of the guesswork from tabmat and improve performance.
I'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be
Don't change any code, but document how Patsy can already be used to construct a dataframe that can then be passed to tabmat / glum. Warn that this involves creating a large dense matrix as an intermediate. See Twitter discussion: https://twitter.com/esantorella22/status/1447980727820296198
Have tabmat call patsy.dmatrix with "return_type = 'dataframe'", then call tabmat.from_pandas on the resulting pd.DataFrame. That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix.
Support very simple patsy-like formulas without having patsy as a dependency or reproducing its full functionality. That would allow the user to designate which columns should be treated as categorical in a more natural way. See Twitter discussion: https://twitter.com/esantorella22/status/1447981081358184461
Make it so that any Patsy formula can be used to create a tabmat object -- I'm not sure how. Might be hard.
The text was updated successfully, but these errors were encountered:
I like the idea, but just want to add a word of caution from my previous experience using patsy. Patsy seems to be focused on non-regularized models. For instance, it's rather cumbersome to specify a one-hot-encoded variable in patsy without dropping a column. I'm sure we could adapt patsy to our needs though.
While thinking about this, I found this: https://github.com/matthewwardrop/formulaic, which seems to be fixing some of patsy's issues and would be easier to integrate to tabmat (since it has sparse matrix support built-in).
PR #267 proposes a formulaic-based formula interface for tabmat, and Glum PR #670 does the same downstream in glum. Any comments or suggestions are much appreciated :)
I think we've discussed this, but I don't remember the conclusion and can't find an issue now.
We recommend
from_pandas
as the way "most users" should construct tabmat objects.from_pandas
then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, sinceI'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be
pd.DataFrame.
That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix.The text was updated successfully, but these errors were encountered: