Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactions #583

Closed
mayer79 opened this issue Nov 15, 2022 · 6 comments · Fixed by #677
Closed

Interactions #583

mayer79 opened this issue Nov 15, 2022 · 6 comments · Fixed by #677

Comments

@mayer79
Copy link

mayer79 commented Nov 15, 2022

Fantastic project.

I would love to see the possibility to add interactions on the fly, just like H20. There, you can provide a list of interaction pairs or, alternatively, a list of columns with pairwise interactions.

This would be especially useful as scikit-learn preprocessing does not allow to create dummy encodings for categorical X and then calculate their product with another feature. (At least not with neat code.)

@lorentzenchr
Copy link
Contributor

@tbenthompson @lbittarello @jtilly Is there any official statement concerning this feature?

In my perspective, being able to specify interaction terms is the largest blind spot of production grade GLMs in python.

@lbittarello
Copy link
Member

@MartinStancsicsQC is looking into it in the context of this PR in tabmat. :)

@MartinStancsicsQC
Copy link
Contributor

Hey @mayer79, @lorentzenchr, I'd be very interested if the formula interface proposed in #670 would fit your use cases for specifying interactions. You can also find some info in this tutorial instead if the PR itself.

@mayer79
Copy link
Author

mayer79 commented Aug 2, 2023

I 👍. The question is: is it efficient? (Interactions with dummies generate many 0). And: is it safe to load a serialized model and use it to predict on unseen data?

@MartinStancsicsQC
Copy link
Contributor

Good points. It should be efficient. For example, in the case of categegorical-categorical interactions, it never actually expands them to dummies. The new (categorical) variable representing the interaction is created directly from category codes.1

And yes, the model remains pickleable (there is a test for this on the tabmat side), and also keeps track of categorical levels2 so it can still predict correctly if there are missing/unseen levels in the new data.


1: More generally, we are not doing a pandas.DataFrame $\xrightarrow[formula]{formulaic.model\_matrix}$pandas.DataFrame $\xrightarrow[]{tabmat.from\_pandas}$ tabmat.MatrixBase type of multi-step process, but instead use an custom formulaic subclass to perform pandas.DataFrame $\xrightarrow[formula]{tabmat.TabmatMaterializer}$ tabmat.MatrixBase directly, utilizing tabmat's strengths.

2: This feature is also a bit more general, and works with a number of stateful transformations. E.g., if you use the scale function in a formula to normalize your predictors, and then you predict on new data, the latter will be normalized based on the mean and variance of the training data.

@mayer79
Copy link
Author

mayer79 commented Aug 3, 2023

Wow, thanks a lot for the explanations. Really looking forward to this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants