-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace combine_dfs operator functionality with sklearn's FeatureUnion #117
Comments
@teaearlgraycold, do you remember why we decided this wasn't feasible with |
IIRC it only works on feature preprocessors. You can't pass it classifiers. Wherever it was, it should be apparent from the documentation. On Wed, Jun 1, 2016, 6:37 PM Randy Olson notifications@github.com wrote:
|
Ah, right: |
@amueller, any ideas on how to make it such that sklearn Classifiers can be included (as discussed above) in FeatureUnions? |
wait depends what you want. for feature selection? Did I provide the example in the first post? Seems odd lol |
Let's use this pipeline as an example: but instead of PCA there's a Random Forest there. So what this pipeline does is:
So we're looking for a sklearn-compatible way to represent that as a pipeline. We originally thought we could do:
but I'm pretty sure that doesn't work out of the box. |
That was the part I wasn't sure about. Ok then |
How would that look in sklearn code? Like this?
|
yeah only that the RFE is around the last RandomForestClassifier. |
I'm not sure what you mean? |
|
Wait, why? Is that specific to RFE or feature selection methods? |
RFE is a model-based feature selection. How should it do feature selection without a model? SelectFromModel is also a meta-estimator, while the feature selection methods that are not model based are not. |
Btw, this is feature selection using RF. That doesn't necessarily imply classification with RF. |
So we could do something like:
if we wanted RF classification at the end of the pipeline. |
|
Ahhh, I get it now. Just looked at the docs for the RFE predict function. Basically:
and
would do the same thing. |
yes |
That's awesome. Looks like we'll be able to export to sklearn pipelines after all, @teaearlgraycold! |
Here's some example code that works: from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.datasets import load_digits
from sklearn.cross_validation import cross_val_score
data = load_digits()
clf = make_pipeline(make_union(PolynomialFeatures(),
VotingClassifier(estimators=[('rf1', RandomForestClassifier())])),
VarianceThreshold(),
SelectKBest(k=5),
RandomForestClassifier())
cross_val_score(clf, data.data, data.target, cv=5) |
I'll keep that in mind when I start working on the refactored code's export utils |
This feature will be in the 0.5 release. |
Currently, combine_dfs uses custom code to combine the features from separate pipelines into a single feature set. We should instead use sklearn's FeatureUnion function within combine_dfs, which I believe will do a better and more efficient job of combining the features.
Here's an example provided by @amueller:
The text was updated successfully, but these errors were encountered: