Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPOTEnsemble idea #479

Open
rhiever opened this issue Jun 1, 2017 · 9 comments
Open

TPOTEnsemble idea #479

rhiever opened this issue Jun 1, 2017 · 9 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Jun 1, 2017

Many people have been asking for a version of TPOT that creates ensembles of pipelines, as that's what often wins Kaggle competitions etc. We've created prototypes of TPOT that ensemble the Pareto front or final population, but those prototypes didn't work so well because TPOT pipelines are optimized to perform well on a dataset by themselves. In other words, there is no pressure from TPOT to create pipelines that work well with other pipelines.

Here's my proposal for allowing TPOT to create ensembles of pipelines: What if we treated the TPOT optimization procedure as a sort of boosting procedure? It could work as follows:

  1. Create initial population (P0) and evaluate them on the dataset as normal.
  2. Take the best pipeline from P0 and put it into a VotingClassifier
  3. Generate the next population (P1) using the normal fitness scores.
  4. When evaluating the individuals in P1, their fitness is computed by evaluating them in the VotingClassifier with the best pipeline from P0
  5. Take the best pipeline from P1 and put it into the VotingClassifier with the best pipeline from P0
  6. Generate the next population using these "ensemble fitness scores"
  7. Evaluate the pipelines in the new generation by evaluating them in a VotingClassifier with the best individuals from the previous generations
  8. etc.

That way, TPOT is directly optimizing for pipelines that ensemble well with the previously-best pipelines, and the final ensemble is composed of one pipeline from each generation. Is this idea crazy enough to work?

@rhiever
Copy link
Contributor Author

rhiever commented Jun 2, 2017

I made a hacky demo of the TPOTEnsemble idea in this commit.

It seemed to work fine in my tests, although it gets much, much slower as the generations pass because, e.g., by generation 100 every pipeline is being evaluated in a VotingClassifier with 99 other pipelines. The only reasonable solution seems to be to store the predictions of each "best" pipeline from every generation, and manually ensemble those predictions with the new predictions from the pipelines in the current generation.

Of course, there will be no way around storing the entire pipeline list in a VotingClassifier for new predictions in the TPOT predict and score functions. But at least the above solution will save evaluating the same list of pipelines over and over again.

@reiinakano
Copy link

Check this out: scikit-learn/scikit-learn#8960

In the next release, scikit-learn is probably going to get an implementation of stacking classifier, so TPOT might be able to search stacked ensembles the same way it searches pipelines.

@rhiever
Copy link
Contributor Author

rhiever commented Jun 8, 2017

Awesome. I look forward to the next release, then!

@simonzcaiman
Copy link

Ensemble of pipelines would be a great improvement for TPOT!
Will it be better if there is a stacking model selection as well? For example, if one does not want to use a VotingClassifier as the stacking model, can he also use another TPOT pipeline optimization to choose the best stacking model?

@rhiever
Copy link
Contributor Author

rhiever commented Jun 14, 2017

@simonzcaiman, this is certainly something we should discuss now before we move forward with actual implementation of TPOTEnsemble. It seems like a good idea to allow different ensemble methods, but I only know of the ones in VotingClassifier from sklearn. Are there are ensemble methods (preferably with a sklearn-like interface) that we should be aware of?

@sashml
Copy link

sashml commented Jun 19, 2017

Are there are ensemble methods (preferably with a sklearn-like interface) that we should be aware of?

Not sure if you should, but Sebastian has own Stacker here https://rasbt.github.io/mlxtend/user_guide/regressor/StackingRegressor/

@rhiever
Copy link
Contributor Author

rhiever commented Jul 17, 2017

Dropping an idea here while it's on my mind: Maybe the original approach to TPOTEnsemble is not good because it requires too many expensive evaluations every generation. Perhaps a better approach would be similar to what @lacava does in FEW:

  1. Take entire TPOT population and stack the outputs into a feature matrix
  2. Fit a regularized (Lasso, preferably) linear model on the feature matrix
  3. Use the linear model coefficients as the fitness of each pipeline

After the first generation, all pipelines with a 0 coefficient will be removed from the TPOT ensemble.

At generation 1 (and beyond), all pipelines in the new population will be added to the TPOT ensemble along with the surviving pipelines currently in the TPOT ensemble. Stack all of the outputs, fit a regularized linear model, and again use the coefficients as the fitness.

Maybe something we can collaborate on, @lacava?

@lacava
Copy link

lacava commented Aug 17, 2017

@rhiever sounds like a good idea. you could use it with any method that admits some kind of feature score, e.g. lasso, random forests, etc.. and perhaps even with stacking if stacking can be made to score the models it uses in its ensemble.

@jonathanng
Copy link

Another strategy would be to use a randomized forest and use the importance weights as the fitness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants