-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Consensus operator #77
Comments
It's not immediately clear to me how to create a pipeline operator that takes an arbitrary number of DataFrames. I think this would be a non-trivial task, at least looking at it at a high level. It may be necessary to implement multiple versions of the Consensus operator that take a varying number of DataFrames as input. |
I'm not sure if this is helpful but in sklearn the VotingClassifier takes an arbitrary number of estimators as input and returns a single estimator. |
I think the primary constraint is whether or not DEAP's |
I'm pretty sure it would allow lists as an input, but how could we then make an easy-to-evolve list of pipelines to pass to it? |
Would it be naive to implement two additional 'helper' operators alongside the consensus one, one of type [list, DataFrame] -> [list] and the other of type [DataFrame, DataFrame] -> [list]? It would balloon the number of operators, but might give us that flexibility. |
That seems to be one way to do it, but I think it would balloon the number of operators and make it more difficult for evolution to work with. I suspect the "best" way to do this is to add a bunch of Consensus operators with increasing numbers of DataFrames as input. Alternatively, we can think of the GP population of pipelines as the ensemble and add an evolvable parameter to allow evolution to pick the best way to combine their classifications. |
I suppose that we could reliably constrain the total number of pipelines being combined together, as it's not like we'll be combining hundreds of pipelines. As for the ensemble approach, would it be as simple as adding a parameter, or would we have to roll our own version of eaSimple? |
Right. It's not 100% clear how large the ensembles should be. The population ensemble approach would require a custom version of eaSimple In the near future, I think the former approach is more promising. Just On Sunday, February 7, 2016, Nathan notifications@github.com wrote:
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
Agreed. I was actually thinking about a similar ensemble approach to see if it helps address overfitting even more than we have in #64, so I wonder if we should just implement a set number of Consensus operators for now and flesh out this ensemble approach elsewhere. As for how to meaningfully combine the classifications in addition to what you mentioned, I bet we can take inspiration from some meta-learning algorithms like AdaBoost. I'll look into it. |
👍 Looking forward to seeing what we can do with this idea. On Sunday, February 7, 2016, Nathan notifications@github.com wrote:
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
Hey @bartleyn, I wanted to check in to see how this issue is coming along. Want to video chat about it? |
Sure, if you'd like. I've got most of the logic in for the Consensus pipeline operator(s), but I'm getting some major memory blowups (which I suppose was to be expected to some degree). I've taken the approach to allow for weighting each DataFrame by some metric (accuracy of the guesses, uniform weights, etc) before combining them with some evolvable method (max, mean, etc). |
Interesting. Is that even with Pareto optimization (as implemented in the On Thu, Feb 18, 2016 at 11:31 PM, Nathan notifications@github.com wrote:
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
Currently, the only way to combine the classifications of two classifier operators is through the
combine_dfs()
method, which only takes the classifications of one of the classifiers and throws out the other.We should add a Consensus operator that allows an arbitrary number of DataFrames to be passed to it, and it uses various ensemble decision criteria (max, mean, majority, etc. -- this would be an evolvable parameter) to actually combine the DataFrame's classifications in some meaningful way.
The text was updated successfully, but these errors were encountered: