Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support initializing matrices with Patsy? #145

Closed
esantorella opened this issue Oct 12, 2021 · 4 comments
Closed

Support initializing matrices with Patsy? #145

esantorella opened this issue Oct 12, 2021 · 4 comments
Milestone

Comments

@esantorella
Copy link
Contributor

I think we've discussed this, but I don't remember the conclusion and can't find an issue now.

We recommend from_pandas as the way "most users" should construct tabmat objects. from_pandas then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, since

  1. R users (including many economists) like using formulas, and
  2. It's easy to infer from a Patsy formula which columns are categorical, which are sparse (generally interactions with categoricals), and which are dense (everything else), so this could remove some of the guesswork from tabmat and improve performance.

I'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be

  1. Don't change any code, but document how Patsy can already be used to construct a dataframe that can then be passed to tabmat / glum. Warn that this involves creating a large dense matrix as an intermediate. See Twitter discussion: https://twitter.com/esantorella22/status/1447980727820296198
  2. Have tabmat call patsy.dmatrix with "return_type = 'dataframe'", then call tabmat.from_pandas on the resulting pd.DataFrame. That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix.
  3. Support very simple patsy-like formulas without having patsy as a dependency or reproducing its full functionality. That would allow the user to designate which columns should be treated as categorical in a more natural way. See Twitter discussion: https://twitter.com/esantorella22/status/1447981081358184461
  4. Make it so that any Patsy formula can be used to create a tabmat object -- I'm not sure how. Might be hard.
@MarcAntoineSchmidtQC
Copy link
Member

MarcAntoineSchmidtQC commented Oct 12, 2021

I like the idea, but just want to add a word of caution from my previous experience using patsy. Patsy seems to be focused on non-regularized models. For instance, it's rather cumbersome to specify a one-hot-encoded variable in patsy without dropping a column. I'm sure we could adapt patsy to our needs though.

While thinking about this, I found this: https://github.com/matthewwardrop/formulaic, which seems to be fixing some of patsy's issues and would be easier to integrate to tabmat (since it has sparse matrix support built-in).

@lorentzenchr
Copy link

As info, patsy has issues with pickle, see pydata/patsy#26.

@MartinStancsicsQC
Copy link
Contributor

PR #267 proposes a formulaic-based formula interface for tabmat, and Glum PR #670 does the same downstream in glum. Any comments or suggestions are much appreciated :)

@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC added this to the Tabmat v4 milestone Feb 6, 2024
@MatthiasSchmidtblaicherQC
Copy link
Contributor

Addressed by #286.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants