Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Consensus operator #77

Closed
rhiever opened this issue Feb 3, 2016 · 13 comments
Closed

Add Consensus operator #77

rhiever opened this issue Feb 3, 2016 · 13 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Feb 3, 2016

Currently, the only way to combine the classifications of two classifier operators is through the combine_dfs() method, which only takes the classifications of one of the classifiers and throws out the other.

We should add a Consensus operator that allows an arbitrary number of DataFrames to be passed to it, and it uses various ensemble decision criteria (max, mean, majority, etc. -- this would be an evolvable parameter) to actually combine the DataFrame's classifications in some meaningful way.

@rhiever
Copy link
Contributor Author

rhiever commented Feb 3, 2016

It's not immediately clear to me how to create a pipeline operator that takes an arbitrary number of DataFrames. I think this would be a non-trivial task, at least looking at it at a high level. It may be necessary to implement multiple versions of the Consensus operator that take a varying number of DataFrames as input.

@kadarakos
Copy link
Contributor

I'm not sure if this is helpful but in sklearn the VotingClassifier takes an arbitrary number of estimators as input and returns a single estimator.

http://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html#example-ensemble-plot-voting-decision-regions-py

@bartleyn
Copy link
Contributor

bartleyn commented Feb 4, 2016

I think the primary constraint is whether or not DEAP's PrimitiveSetTyped will allow for operators that take in iterables as parameters, no? I can dig around to see if there's something obvious we're missing.

@rhiever
Copy link
Contributor Author

rhiever commented Feb 4, 2016

I'm pretty sure it would allow lists as an input, but how could we then make an easy-to-evolve list of pipelines to pass to it?

@bartleyn
Copy link
Contributor

bartleyn commented Feb 5, 2016

Would it be naive to implement two additional 'helper' operators alongside the consensus one, one of type [list, DataFrame] -> [list] and the other of type [DataFrame, DataFrame] -> [list]? It would balloon the number of operators, but might give us that flexibility.

@rhiever
Copy link
Contributor Author

rhiever commented Feb 7, 2016

That seems to be one way to do it, but I think it would balloon the number of operators and make it more difficult for evolution to work with. I suspect the "best" way to do this is to add a bunch of Consensus operators with increasing numbers of DataFrames as input.

Alternatively, we can think of the GP population of pipelines as the ensemble and add an evolvable parameter to allow evolution to pick the best way to combine their classifications.

@bartleyn
Copy link
Contributor

bartleyn commented Feb 7, 2016

I suppose that we could reliably constrain the total number of pipelines being combined together, as it's not like we'll be combining hundreds of pipelines.

As for the ensemble approach, would it be as simple as adding a parameter, or would we have to roll our own version of eaSimple?

@rhiever
Copy link
Contributor Author

rhiever commented Feb 7, 2016

Right. It's not 100% clear how large the ensembles should be.

The population ensemble approach would require a custom version of eaSimple
because the population is evaluated together. Probably worth looking into
learning classifier systems (specifically, Michigan style LCS) for
inspiration on that end.

In the near future, I think the former approach is more promising. Just
need to make sure that evolution can actually make use of those Consensus
operators.

On Sunday, February 7, 2016, Nathan notifications@github.com wrote:

I suppose that we could reliably constrain the total number of pipelines
being combined together, as it's not like we'll be combining hundreds of
pipelines.

As for the ensemble approach, would it be as simple as adding a parameter,
or would we have to roll our own version of eaSimple?


Reply to this email directly or view it on GitHub
#77 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@bartleyn
Copy link
Contributor

bartleyn commented Feb 7, 2016

Agreed. I was actually thinking about a similar ensemble approach to see if it helps address overfitting even more than we have in #64, so I wonder if we should just implement a set number of Consensus operators for now and flesh out this ensemble approach elsewhere.

As for how to meaningfully combine the classifications in addition to what you mentioned, I bet we can take inspiration from some meta-learning algorithms like AdaBoost. I'll look into it.

@rhiever
Copy link
Contributor Author

rhiever commented Feb 7, 2016

👍 Looking forward to seeing what we can do with this idea.

On Sunday, February 7, 2016, Nathan notifications@github.com wrote:

Agreed. I was actually thinking about a similar ensemble approach to see
if it helps address overfitting even more than we have in #64
#64, so I wonder if we should
just implement a set number of Consensus operators for now and flesh out
this ensemble approach elsewhere.

As for how to meaningfully combine the classifications, I bet we can take
inspiration from some meta-learning algorithms like AdaBoost. I'll look
into it.


Reply to this email directly or view it on GitHub
#77 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@rhiever
Copy link
Contributor Author

rhiever commented Feb 17, 2016

Hey @bartleyn, I wanted to check in to see how this issue is coming along. Want to video chat about it?

@bartleyn
Copy link
Contributor

Sure, if you'd like. I've got most of the logic in for the Consensus pipeline operator(s), but I'm getting some major memory blowups (which I suppose was to be expected to some degree). I've taken the approach to allow for weighting each DataFrame by some metric (accuracy of the guesses, uniform weights, etc) before combining them with some evolvable method (max, mean, etc).

@rhiever
Copy link
Contributor Author

rhiever commented Feb 19, 2016

Interesting. Is that even with Pareto optimization (as implemented in the
latest version of TPOT)? We should video chat to discuss what's going on.

On Thu, Feb 18, 2016 at 11:31 PM, Nathan notifications@github.com wrote:

Sure, if you'd like. I've got most of the logic in for the Consensus
pipeline operator(s), but I'm getting some major memory blowups (which was
to be expected to some degree). I've taken the approach to allow for
weighting each DataFrame by some metric (accuracy of the guesses, uniform
weights, etc) before combining them with some evolvable method (max, mean,
etc).


Reply to this email directly or view it on GitHub
#77 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants