Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Consensus Operators #96

Closed
wants to merge 17 commits into from
Closed

Conversation

bartleyn
Copy link
Contributor

What does this PR do?

Addresses #77, adds three Consensus pipeline operators: consensus_two, consensus_three, and consensus_four. Adds corresponding export_utils code and a test.

Where should the reviewer start?

consensus_two and the weighting/combination functions defined above it.

How should this PR be tested?

Seeing if the consensus operators contribute more to the overall fitness of the populations generated than just _combine_dfs. The export code could use more thorough testing as well.

Any background context you want to provide?

I originally had an additional weighting scheme I was trying to put into place, but implementing it was challenging, so I opted to remove it.

What are the relevant issues?

#77

Screenshots (if appropriate)

Questions:

  • Do the docs need to be updated?
    I don't think so.
  • Does this PR add new (Python) dependencies?
    No, everything's implemented from scratch.

@rhiever
Copy link
Contributor

rhiever commented Feb 27, 2016

Happy to see this PR come in! Were the consensus operators used in any of your tests? I'm currently running a big TPOT benchmark on the cluster, but I'll line this PR up for the next benchmark in line.

@bartleyn
Copy link
Contributor Author

Yeah I ran numerous small tests that ended up with consensus in the pipeline. Performed well, but tough to compare since some of the other runs ended up with (presumably) overfit simple pipelines with perfect accuracy.

@rhiever
Copy link
Contributor

rhiever commented Feb 27, 2016

Sounds promising! I look forward to benchmarking the code then.

It may take a while to get to the benchmark, though. Just a heads up.

@rhiever
Copy link
Contributor

rhiever commented Feb 27, 2016

Looks like your tests are having some issues with Python 3. I think it's because you're using Python 2 print statements. 👎 ;-)

@bartleyn
Copy link
Contributor Author

Whoops, that's what I get for being stuck in the 2.7 past.

@rhiever
Copy link
Contributor

rhiever commented Feb 27, 2016

Tch tch tch... join us in ze future! 👍

@rhiever
Copy link
Contributor

rhiever commented Feb 27, 2016

D'oh! Now it's failing the unit tests.

@bartleyn
Copy link
Contributor Author

I'm out and about right now, but I wouldn't be surprised if I was testing with older tests. I'll test again when I get back.

@bartleyn
Copy link
Contributor Author

With the above commits I've made the necessary changes to run all the tests in tests.py, and tested functionality with some small examples all within a Python 3.5 environment. Are there any tests I'm missing?

@rhiever
Copy link
Contributor

rhiever commented Feb 28, 2016

Integration tests (i.e., running TPOT on a fixed data set with a fixed RNG
seed for a fixed number of generations and checking the output) are always
useful, but I;m not sure if we can do that here.

On Sun, Feb 28, 2016 at 10:48 AM, Nathan notifications@github.com wrote:

With the above commits I've made the necessary changes to run all the
tests in tests.py, and tested functionality with some small examples all
within a Python 3.5 environment. Are there any tests I'm missing?


Reply to this email directly or view it on GitHub
#96 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@bartleyn
Copy link
Contributor Author

I'll do a set of runs on the MNIST data and report some stats on the performance and appearance of consensus operators compared to others.

@rhiever
Copy link
Contributor

rhiever commented Mar 3, 2016

Just a small update: The base TPOT benchmark should finish up by the weekend, after which point I'll be able to throw on "TPOT-Consensus" and give it a serious spin. Will keep you posted.

@bartleyn
Copy link
Contributor Author

bartleyn commented Mar 7, 2016

Okay, then I will hold off from getting deeper into this until the benchmarks and work on something else. I think that there's a lot more information encoded in the input features from the various DataFrames rather than just using the guesses, but perhaps it's not worth the effort right now. Are there any other schemes we want to test, though? Threshold?

@rhiever
Copy link
Contributor

rhiever commented Mar 7, 2016

Benchmarks are queued now. Have 10 copies of TPOT-Consensus running against 90 different data sets. Analyzing the resulting best pipelines should give us a good sense of whether the consensus operators are usefully contributing or not.

Are there any other schemes we want to test, though? Threshold?

Yes, that could be a good one.

@rhiever
Copy link
Contributor

rhiever commented Mar 7, 2016

BTW, if you want to take a stab at #105 in a separate branch in the meantime, that would be awesome. I think that's a huge issue to address on the research end right now.

@bartleyn
Copy link
Contributor Author

bartleyn commented Mar 8, 2016

I realized that I might have a different idea of thresholding than what you're talking about: I'm thinking of assigning a DataFrame a 0 weight (eliminating impact on the guesses) if they do not pass a (perhaps parameterized) threshold of accuracy.

@rhiever
Copy link
Contributor

rhiever commented Mar 9, 2016

Ah, yes. I usually think of threshold as "if X% of guesses are for one
class, then it's that class." Maybe that will be too difficult in the multi
class case though.

On Tuesday, March 8, 2016, Nathan notifications@github.com wrote:

I realized that I might have a different idea of thresholding than what
you're talking about: I'm thinking of assigning a DataFrame a 0 weight
(eliminating impact on the guesses) if they do not pass a (perhaps
parameterized) threshold of accuracy.


Reply to this email directly or view it on GitHub
#96 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@bartleyn
Copy link
Contributor Author

bartleyn commented Mar 9, 2016

Oh wait I merged the upstream changes without thinking about the possible consequences for the benchmark tests; should I go ahead and revert the merge?

@rhiever
Copy link
Contributor

rhiever commented Mar 9, 2016

Well, I already have a copy of TPOT-Consensus on the HPCC, so it should be okay.

@rhiever
Copy link
Contributor

rhiever commented Mar 11, 2016

Another small update: HPCC is taking bloody forever to run these jobs. They're stuck in a queue behind some bigger jobs I had queued. Bad queue management system... sigh.

@rhiever
Copy link
Contributor

rhiever commented Mar 15, 2016

The jobs are finishing up today, so I should be able to analyze the results tomorrow morning and see how this turned out.

Also looks like this branch has conflicts with the latest version of TPOT. Argh. Let's not bother cleaning up that merge until we see if this feature will allow for better pipelines.

@bartleyn
Copy link
Contributor Author

Agreed. It's not worth it to fix the merge if the results aren't looking good. But if they are (fingers crossed), at least this PR is only ~a week behind.

@rhiever
Copy link
Contributor

rhiever commented Mar 16, 2016

Welp... I'm sad to report that TPOT doesn't really seem to be evolving pipelines with the consensus operator. Only 1.5% of the pipelines from the benchmark even contained a consensus operator, and none of those really seemed to use them in a meaningful way.

It's possible that Pareto optimization is disfavoring the larger pipelines that the consensus operators entail. If you want to roll back the GP selection process to simply maximize classification accuracy again, I can grab the latest from this fork and re-run the benchmark.

I should also note that a large portion (over half) of the runs didn't finish in time -- I only gave each run 8 hours to complete 100 generations -- so it's possible that consensus operators were being used there. That's still a bad sign, though, as it likely means that TPOT with the consensus operators are even slower than it already is. Not good!

Perhaps a more promising path is to try to combine the population of pipelines into ensembles, as in #105. Really looking forward to hearing how that pans out.

@bartleyn
Copy link
Contributor Author

That stinks, but negative results are useful results too, I suppose. I'll take a look at testing without Pareto optimization when I get the chance, but I agree that #105 is probably more promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants