Improve resiliency of trial reservation #693
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why:
We had two levels of patience when reserving a trial. There was the
customizable
max_idle_time
used in producer.produce() to limit thetime spend trying to generate new trials, and there was
_max_depth
inreserve_trial
limiting the number of times a reservation would beattempted and producer.produce would be called. This lead to misleading
error messages. For instance, with many workers it happened that a
worker would always be unable to reserve a trial because each time it
executed producer.produce() all other workers would reserve the trials
before the current worker had time to reserve one. In such scenario the
error message would be that the algorithm was unable to sample new point
and is waiting for trials to complete. It is not true. We should state
the number of trials that were generated during these reservation
attempts and recommend increasing the pool-size and timeout.
How:
Producer.produce only attempts producing
pool-size
once (callingalgo.suggest only once) and returns the number of successfully
produced trials. The whole patience is moved to
reserve_trial
whereit attempts reserving and producing until it reaches the timeout, in
which case a helpful error message is raised.