Improve resiliency of trial reservation #693

bouthilx · 2021-11-23T16:09:01Z

Why:

We had two levels of patience when reserving a trial. There was the
customizable max_idle_time used in producer.produce() to limit the
time spend trying to generate new trials, and there was _max_depth in
reserve_trial limiting the number of times a reservation would be
attempted and producer.produce would be called. This lead to misleading
error messages. For instance, with many workers it happened that a
worker would always be unable to reserve a trial because each time it
executed producer.produce() all other workers would reserve the trials
before the current worker had time to reserve one. In such scenario the
error message would be that the algorithm was unable to sample new point
and is waiting for trials to complete. It is not true. We should state
the number of trials that were generated during these reservation
attempts and recommend increasing the pool-size and timeout.

How:

Producer.produce only attempts producing pool-size once (calling
algo.suggest only once) and returns the number of successfully
produced trials. The whole patience is moved to reserve_trial where
it attempts reserving and producing until it reaches the timeout, in
which case a helpful error message is raised.

Why: We had two levels of patience when reserving a trial. There was the customizable `max_idle_time` used in producer.produce() to limit the time spend trying to generate new trials, and there was `_max_depth` in `reserve_trial` limiting the number of times a reservation would be attempted and producer.produce would be called. This lead to misleading error messages. For instance, with many workers it happened that a worker would always be unable to reserve a trial because each time it executed producer.produce() all other workers would reserve the trials before the current worker had time to reserve one. In such scenario the error message would be that the algorithm was unable to sample new point and is waiting for trials to complete. It is not true. We should state the number of trials that were generated during these reservation attempts and recommend increasing the pool-size and timeout. How: Producer.produce only attempts producing `pool-size` once (calling algo.suggest only once) and returns the number of successfully produced trials. The whole patience is moved to `reserve_trial` where it attempts reserving and producing until it reaches the timeout, in which case a helpful error message is raised.

bouthilx added this to the v0.2 milestone Nov 23, 2021

bouthilx added this to In progress in Release v0.2.0 via automation Nov 23, 2021

bouthilx force-pushed the feature/improved_producer_resiliency branch 4 times, most recently from 91be352 to 8bcfd85 Compare November 23, 2021 17:13

bouthilx force-pushed the feature/improved_producer_resiliency branch from 8bcfd85 to 6720033 Compare November 23, 2021 19:08

bouthilx merged commit 9ca27cf into Epistimio:develop Nov 24, 2021

Release v0.2.0 automation moved this from In progress to Done Nov 24, 2021

bouthilx mentioned this pull request Nov 24, 2021

Release v0.2.0rc1 #695

Merged

bouthilx added the enhancement Improves a feature or non-functional aspects (e.g., optimization, prettify, technical debt) label Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resiliency of trial reservation #693

Improve resiliency of trial reservation #693

bouthilx commented Nov 23, 2021

Improve resiliency of trial reservation #693

Improve resiliency of trial reservation #693

Conversation

bouthilx commented Nov 23, 2021