Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPOT stuck at 75th generation with no errors #1214

Open
aecomesana opened this issue Jun 9, 2021 · 4 comments
Open

TPOT stuck at 75th generation with no errors #1214

aecomesana opened this issue Jun 9, 2021 · 4 comments

Comments

@aecomesana
Copy link

I am running the GPU-accelerated (using dask) configuration of TPOT (TPOT version 0.11.7) on a couple of different data. I am also using TPOT cuML in the configuration. I am using python 3 with anaconda.

For all the data, TPOT is getting stuck at generation 74 or 75, no matter the size of the databases (some of them are 480 rows, 10 columns, up to 9000 rows, 83 columns). No error is output, the periodic checkpoint folder just stops updating, and no new messages appear. I have left it running but after 8 hours nothing new came up.
I have changed the random seed of the TPOT regressor to see if it would be an issue with a specific model architecture, but changing the seed still results in it getting stuck at generation 75.

My tpot regressor looks as follows:
tpot = TPOTRegressor(verbosity=2,
use_dask = True,
n_jobs=-1,
cv=5,
random_state=42, #this was changed, as mentioned above
template='Regressor',
config_dict='TPOT cuML',
periodic_checkpoint_folder='../checkpoints/{}/'.format(target),
max_time_mins = None
)

Any idea how to solve this issue/ why it would be happening every time for different data?
Thank you!

@beckernick
Copy link
Contributor

beckernick commented Jul 13, 2021

You should use n_jobs=1 (the default). cuML is currently designed for the "one process per GPU" paradigm". Additionally, how are you setting up your Dask cluster?

It might be valuable to test your system and environment with this example gist or confirm your configuration is similar.

@rhamnett
Copy link

I just got caught by this, there needs to be a better error message when using cuML and leaving the n_jobs set to -1.

@beckernick
Copy link
Contributor

If the maintainers are open to it, perhaps we could open a PR that validates the n_jobs parameter when the cuML configuration is used.

@rhamnett
Copy link

If the maintainers are open to it, perhaps we could open a PR that validates the n_jobs parameter when the cuML configuration is used.

Yes probably just >0 as I think you can use multiple GPUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants