Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does TPOT support memory when running dask.distributed? #1228

Open
KrzysztofNawara opened this issue Sep 4, 2021 · 0 comments
Open

Does TPOT support memory when running dask.distributed? #1228

KrzysztofNawara opened this issue Sep 4, 2021 · 0 comments

Comments

@KrzysztofNawara
Copy link

I wanted to use TPOT with

  1. dask.distributed running multiple processes on the local machine
  2. memory enabled to cache common transformation across processes (it's supposed to be multiprocessing-safe)

But I did two things that make me thing this mode of operation is not supported:

  1. Setting breakpoint inside's joblib.Memory.cache() function - it only get's called to check if produced individual is valid (check_pipeline/_pre_test function)
  2. Looking at the code that actually performs evaluation of individuals. Everything seems to happen inside dask_ml.model_selection._search.build_graph(). But the way it handles pipelines (if my analysis is correct) is to recursively extract all leaf transformers and estimators, turn them into Dask graph nodes and then, at the end, rebuild pipelines. No sklearn.Pipeline code appears to be executed (and that's where caching is implemented)

My questions are as follow:

  1. Is my analysis correct and that mode is indeed unsupported?
  2. What would be the easiest way to add this caching functionality?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant