Replace combine_dfs operator functionality with sklearn's FeatureUnion #117

rhiever · 2016-03-20T13:03:56Z

Currently, combine_dfs uses custom code to combine the features from separate pipelines into a single feature set. We should instead use sklearn's FeatureUnion function within combine_dfs, which I believe will do a better and more efficient job of combining the features.

Here's an example provided by @amueller:

Pipeline(make_union(PolynomialFeatures(), PCA()), RFE(RandomForestClassifier()))

rhiever · 2016-06-01T22:37:37Z

@teaearlgraycold, do you remember why we decided this wasn't feasible with FeatureUnion? We should document that here.

danthedaniel · 2016-06-01T22:41:21Z

IIRC it only works on feature preprocessors. You can't pass it classifiers.

Wherever it was, it should be apparent from the documentation.

On Wed, Jun 1, 2016, 6:37 PM Randy Olson notifications@github.com wrote:

@teaearlgraycold https://github.com/teaearlgraycold, do you remember
why we decided this wasn't feasible with FeatureUnion? We should document
that here.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#117 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ADISYywOFmZVhr2HaI5MFbvIE2B8XAkcks5qHgm0gaJpZM4H0qUh
.

rhiever · 2016-06-01T22:43:55Z

Ah, right: FeatureUnion only accepts sklearn preprocessors that have a transform() function. So unless there's an easy way to wrap the sklearn classifiers such that they have a transform() function that simply adds the classifier's predictions as a new feature...

rhiever · 2016-06-01T23:08:51Z

@amueller, any ideas on how to make it such that sklearn Classifiers can be included (as discussed above) in FeatureUnions?

amueller · 2016-06-02T15:36:09Z

Voting classifier: http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

amueller · 2016-06-02T15:37:27Z

wait depends what you want. for feature selection? Did I provide the example in the first post? Seems odd lol

rhiever · 2016-06-02T15:42:39Z

Let's use this pipeline as an example:

but instead of PCA there's a Random Forest there. So what this pipeline does is:

Take a copy of the data set and apply Polynomial Features to create data set A
Take another copy of the data set, fit a Random Forest, and take the predictions of the Random Forest and add them to the data set as a new feature to create data set B
Combine the features of data sets A and B into a single data set
Apply RFE
Fit another Random Forest to the features left after RFE and use that Random Forest's predictions as the final prediction for the pipeline

So we're looking for a sklearn-compatible way to represent that as a pipeline. We originally thought we could do:

make_pipeline(make_union(PolynomialFeatures(), RandomForestClassifier()), RFE(), RandomForestClassifier())

but I'm pretty sure that doesn't work out of the box.

amueller · 2016-06-02T15:43:35Z

and take the predictions of the Random Forest

That was the part I wasn't sure about. Ok then VotingClassifier.

rhiever · 2016-06-02T15:46:19Z

How would that look in sklearn code? Like this?

make_pipeline(make_union(PolynomialFeatures(), VotingClassifier(estimators=['rf1', RandomForestClassifier()])), RFE(), RandomForestClassifier())

amueller · 2016-06-02T15:48:37Z

yeah only that the RFE is around the last RandomForestClassifier.

rhiever · 2016-06-02T16:57:15Z

I'm not sure what you mean?

amueller · 2016-06-02T17:00:35Z

make_pipeline(make_union(PolynomialFeatures(), VotingClassifier(estimators=['rf1', RandomForestClassifier()])), RFE(RandomForestClassifier()))

rhiever · 2016-06-02T17:03:31Z

Wait, why? Is that specific to RFE or feature selection methods?

amueller · 2016-06-02T17:07:15Z

RFE is a model-based feature selection. How should it do feature selection without a model? SelectFromModel is also a meta-estimator, while the feature selection methods that are not model based are not.

amueller · 2016-06-02T17:07:54Z

Btw, this is feature selection using RF. That doesn't necessarily imply classification with RF.

rhiever · 2016-06-02T17:15:15Z

So we could do something like:

make_pipeline(make_union(PolynomialFeatures(),
VotingClassifier(estimators=['rf1', RandomForestClassifier()])),
RFE(estimator=SVC(kernel='linear')),
RandomForestClassifier())

if we wanted RF classification at the end of the pipeline.

amueller · 2016-06-02T17:17:43Z

RFE has a predict, so both of the pipelines you outlined can predict. They do different things, though.

rhiever · 2016-06-02T17:22:03Z

Ahhh, I get it now. Just looked at the docs for the RFE predict function. Basically:

make_pipeline(make_union(PolynomialFeatures(),
VotingClassifier(estimators=['rf1', RandomForestClassifier()])),
RFE(estimator=RandomForestClassifier()))

and

make_pipeline(make_union(PolynomialFeatures(),
VotingClassifier(estimators=['rf1', RandomForestClassifier()])),
RFE(estimator=RandomForestClassifier()),
RandomForestClassifier())

would do the same thing.

amueller · 2016-06-02T18:42:18Z

yes

rhiever · 2016-06-02T19:03:56Z

That's awesome. Looks like we'll be able to export to sklearn pipelines after all, @teaearlgraycold!

rhiever · 2016-06-02T19:05:17Z

Here's some example code that works:

from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.datasets import load_digits
from sklearn.cross_validation import cross_val_score

data = load_digits()

clf = make_pipeline(make_union(PolynomialFeatures(),
                               VotingClassifier(estimators=[('rf1', RandomForestClassifier())])),
                    VarianceThreshold(),
                    SelectKBest(k=5),
                    RandomForestClassifier())

cross_val_score(clf, data.data, data.target, cv=5)

danthedaniel · 2016-06-02T19:07:32Z

I'll keep that in mind when I start working on the refactored code's export utils

rhiever · 2016-08-19T18:34:20Z

This feature will be in the 0.5 release.

rhiever added enhancement need contributor labels Mar 20, 2016

rhiever closed this as completed Aug 19, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace combine_dfs operator functionality with sklearn's FeatureUnion #117

Replace combine_dfs operator functionality with sklearn's FeatureUnion #117

rhiever commented Mar 20, 2016

rhiever commented Jun 1, 2016

danthedaniel commented Jun 1, 2016

rhiever commented Jun 1, 2016 •

edited

Loading

rhiever commented Jun 1, 2016

amueller commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016 •

edited

Loading

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016 •

edited

Loading

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

rhiever commented Jun 2, 2016 •

edited

Loading

danthedaniel commented Jun 2, 2016

rhiever commented Aug 19, 2016

Replace combine_dfs operator functionality with sklearn's FeatureUnion #117

Replace combine_dfs operator functionality with sklearn's FeatureUnion #117

Comments

rhiever commented Mar 20, 2016

rhiever commented Jun 1, 2016

danthedaniel commented Jun 1, 2016

rhiever commented Jun 1, 2016 • edited Loading

rhiever commented Jun 1, 2016

amueller commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

amueller commented Jun 2, 2016

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016 • edited Loading

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016 • edited Loading

amueller commented Jun 2, 2016

rhiever commented Jun 2, 2016

rhiever commented Jun 2, 2016 • edited Loading

danthedaniel commented Jun 2, 2016

rhiever commented Aug 19, 2016

rhiever commented Jun 1, 2016 •

edited

Loading

rhiever commented Jun 2, 2016 •

edited

Loading

rhiever commented Jun 2, 2016 •

edited

Loading

rhiever commented Jun 2, 2016 •

edited

Loading