Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First [uber hacky] attempt to address tpot issue #95: parallelizing t… #100

Closed
wants to merge 1 commit into from

Conversation

magsol
Copy link

@magsol magsol commented Mar 1, 2016

This is a first attempt to parallelize the GA pipeline in tpot. Since tpot uses DEAP under the hood for the parallel algorithm, the parallelization strategy involves overriding its map implementation.

The DEAP documentation specifies two methods for parallelization: SCOOP and Python's built-in multiprocessing library. The former is not feasible, given the entire package has to be run through it. The latter is possible, but there is a serialization error by which DEAP attempts to pickle the Toolbox object and, therefore, the encapsulated Pool object (this error).

The workaround implemented in this PR is to extend the DEAP ToolBox to include required __getstate__ and __setstate__ methods that eliminate the Pool object just prior to serialization.

Where should the reviewer start?

The parallelization appears to work (see screenshot), but it's unclear if a whole separate subclass is required. joblib would be preferable here, but there doesn't seem to be a way to decouple the partial it generates from its dispatcher and hand that over to DEAP.

How should this PR be tested?

Performance testing, primarily. From the more knowledgeable committers, a deeper dive into whether or not subclassing is really the answer here.

Any background context you want to provide?

Mostly provided above. This is not ready for inclusion; several other important points remain, mainly a command-line parameter to control for parallelization. I just wanted to get the review process rolling.

What are the relevant issues?

#95

Screenshots (if appropriate)

screen shot 2016-03-01 at 4 27 39 pm

## Questions: - Do the docs need to be updated? Yes, we'll need to add a command-line parameter to control parallelization. - Does this PR add new (Python) dependencies? No, just a new module.

@rhiever
Copy link
Contributor

rhiever commented Mar 3, 2016

Cool beans! That's all it took to parallelize DEAP? Very curious to see how effectively it shares memory. I'm rushing against paper deadlines again, but I'm very eager to tinker with this PR to see if we can merge it into TPOT-master.

My main concerns are:

  • If we pass a large data set to TPOT, does memory usage explode because it's evaluating many pipelines simultaneously? How effectively does it share memory? (Perhaps hard-code some complicated pipelines and run them in parallel.)
  • How well do TPOT pipelines parallelize? Is there a speedup over a non-parallelized version?
  • Does parallelizing the pipeline evaluations affect reproducibility? Will there be race conditions that affect reproducibility? (We've tried hard to code the pipeline operators such that they're deterministic, but it's possible parallelization may discover something we missed.)

@rhiever
Copy link
Contributor

rhiever commented Aug 13, 2016

Going to close this PR since we won't be able to merge it into the existing dev branch, but thank you for putting the demo together. We will refer to this PR when considering parallelization options for TPOT.

@rhiever rhiever closed this Aug 13, 2016
@magsol
Copy link
Author

magsol commented Aug 13, 2016

Sorry for dropping off; job took over and hasn't let up. However I am working on designing additional experiments for a paper that will involve testing multiple ML pipelines, so if teaching this semester can become somewhat self-sustaining I may be able revisit this issue sooner rather than later.

On Aug 13, 2016, at 14:34, Randy Olson notifications@github.com wrote:

Going to close this PR since we won't be able to merge it into the existing dev branch, but thank you for putting the demo together. We will refer to this PR when considering parallelization options for TPOT.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@rhiever
Copy link
Contributor

rhiever commented Aug 13, 2016

Totally understand! If you end up putting together another PR, please PR it to the development branch. master is reserved for the latest stable release now.

@magsol
Copy link
Author

magsol commented Aug 13, 2016

Will do, thanks!
On Sat, Aug 13, 2016 at 14:56 Randy Olson notifications@github.com wrote:

Totally understand! If you end up putting together another PR, please PR
it to the development branch. master is reserved for the latest stable
release now.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#100 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIQ-dKl7oHdutT3Ub886CgjOvlHlKYWks5qfhNAgaJpZM4Hm4zt
.

iPhone'd

@mficek
Copy link
Contributor

mficek commented Feb 23, 2017

Hi guys, is there any progress on this PR? I'm playing now with tpot and paralellizing it on DEAP level would be great. May I ask why what was the reason that prevented you from merging magsol's PR into the dev branch? If I pick this up, some background knowledge could help a lot. Thanks!

@magsol
Copy link
Author

magsol commented Feb 23, 2017

@mficek The reason was pretty straightforward: I ran out of free time to run all the experiments that would empirically show one way or another whether this particular strategy of parallelization yielded any benefit :) With spring break coming up it's possible I may be able to come back to this, especially if I know someone is actively interested.

@rhiever
Copy link
Contributor

rhiever commented Feb 23, 2017

Check the development branch of TPOT. @weixuanfu2016 implemented multiprocessing for TPOT there, although I'll note that it's still not fully validated.

@mficek
Copy link
Contributor

mficek commented Feb 24, 2017

Yes, I tried @weixuanfu2016's version yesterday, it looks great! And now I see it's even in the development branch, sweet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants