Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

Open
s-marton opened this issue Jun 3, 2019 · 8 comments
Open

TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

s-marton opened this issue Jun 3, 2019 · 8 comments
Labels

Comments

@s-marton
Copy link

s-marton commented Jun 3, 2019

Hello,

I am trying to get TPOT running since a while now but always encounter the same errro. I have a linux machine with 24 kernels. When I run TPOT on a large dataset (~6mio rows, ~20 features) it freezes at 0% and after about 10-20 minutes the CPU goes down to a few percent. I already tried setting the multiprocessing to forkserver without any changes. I also tried the dask implementation, but since the max_eval_time_mins does not seem to work there, it runs forever.

However, the problem does not occur when n_jobs != 1 but just if n_jobs > 4. I do not really know what else to try and I would appreciate any suggestions.

Thanks!

aml_tpot = TPOTRegressor(scoring = 'neg_mean_squared_error',                          
                                        generations=20, 
                                        population_size=50, 
                                        verbosity=3, 
                                        random_state = RANDOM_SEED, 
                                        n_jobs = 16,
                                        max_eval_time_mins = 20,
                                        cv = 3,
                        )

aml_tpot.fit(X_train.values, y_train.values.ravel())
@weixuanfu weixuanfu added the bug label Jun 3, 2019
@weixuanfu
Copy link
Contributor

It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below?

pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development

We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now.

@s-marton
Copy link
Author

s-marton commented Jun 3, 2019

Thanks for the quick reply!
joblib 0.13.2 is the current version and I cant update joblib (> 0.13.2) or did I get something wrong here?

I installed the development branch and tried it again with joblib == 0.13.2 and scikit-learn >=0.21, but unfortunately it is still freezing.

@s-marton
Copy link
Author

s-marton commented Jun 3, 2019

However, I just tried it again and now it is stuck at 5% (54/1050).

@s-marton
Copy link
Author

s-marton commented Jun 6, 2019

Changing the value of DEFAULT_THREAD_BACKEND = 'threading' to e.g. 'loky' in parallel.py of joblib worked for me.

@huaiyizhao
Copy link

Not working. It runs only when n_jobs is set to 1

@huaiyizhao
Copy link

@Chowkah Could you please talk about how can you reach 5%. I still stuck at 0%

@s-marton
Copy link
Author

I cant really nail it down to a point, I tried several different things, sometimes it was working, sometimes not. I changed the parallel backend directly in the parallel.py of joblib which sometimes helped. Additionally I changed my random seed to some other value and with the same setting it was working. So the problem might be related to a specific algorithm (maybe just with some specific parameter setting) that makes TPOT freeze. However, I was not able to identify which one it might be.

@huaiyizhao
Copy link

Thank you, maybe I should start with examples in official doc, make a few changes every time and see what will happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants