Fix PBT issues with working dir and promotion of max fidelity trials #903

bouthilx · 2022-05-03T18:56:26Z

@Delaunay @lebrice This includes a fugly fix to working dir paths of trials that depends on the trial id which is different in transformed space and original space. To makes things worse, the ID of the trial depends on the experiment ID which is only set outsite of the algorithm by the experiment object after the trial was suggested. Because of this we need to rename directories or create sym-links at the different levels (In algo, in algo-wrapper, in experiment) to make sure that the working dir of the trial is corresponding to the one communicated to the user through trial.working_dir or env var ORION_WORKING_DIR.

We could get rid of 1 level by making the trial id not depend on experiment. In the database the index of trials should depend on trial.id and trial.experiment however to make sure trials are not duplicated within a given experiment. That would require updates of database from the side of users after next release however.

I don't see a simple and clean fix however for the different working_dir inside the algorithm and inside the algo-wrapper. The parameters are different and thus the ID and working_dir is different.

Would you have any ideas?

lebrice

To summarize the discussion we had on Slack:

The PBT algorithm shouldn deal with copying / creating directories for trials, since it doesn't have any way of knowing which trials are actually new.

This responsibility should be moved to the Experiment, which would use the lineage information and the experiment id to create the working directory of the trial before it is passed to the User, and copy the files as necessary.

src/orion/core/worker/primary_algo.py

tests/unittests/algo/pbt/test_pbt.py

bouthilx · 2022-05-10T18:12:20Z

Actually it should be the Runner's responsibility considering coming changes with #911. The trials will be generated on a remote server and thus the directory creation/copy should not occur during the trial generation on the server but rather in the Runner which is assumed to share the same file system as workers.

If the list of choices contains multiple types, numpy's RNG will cast all values to the type of the first item.

Why: When the trial has a parent, it means it should start from the same working_dir state. We need to fetch the parent trial and copy its working dir to the current trial's working dir. How: The runner now has a new argument, a callable that will be executed before a trial is submitted to the executor. By default this callable will take care of copying parent trial working_dir to child trial working dir, or simply create the working dir if the trial has no parent. If the callable fails, the exception is caught and delayed to be caught again during the executor's execution of the trial. This makes it possible to handle these trials the same way as if they crashed during trial execution rather than during trial preparation.

Why: The parent id is in transformed space. When converting a trial from transformed space to original space, the id of the parent must be converted too. If the parent has a parent too, then this grand-parent must be converted too, and so on. How: Recursively get to the root trial, and then compute the trial ids down to last parent trial. Tests are added to verify this.

Why: The number of trials executed `self.trials` is only incremented when trials are completed successfully, not when they crash. Thus, computing `self.trials - self.worker_broken_trials` would not count properly the number of completed trials by this runner.

lebrice

Left some minor comments.

src/orion/client/experiment.py

src/orion/client/runner.py

src/orion/core/worker/primary_algo.py

tests/unittests/client/test_runner.py

tests/unittests/core/test_primary_algo.py

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

…tfix/pbt_cp_dir

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

…tfix/pbt_cp_dir

It should not assume that algorithms are able to sample 2 trials at the same time. Therefore we won't test that it returns different trials one after the other. Only that seeding gives the same trial. The convergence test will figure it out if the algorithm always return the same trial.

src/orion/algo/axoptimizer.py

lebrice · 2022-07-25T20:50:44Z

src/orion/algo/hebo/hebo_algo.py

-            fidelity_dim: Fidelity = self.space[self.fidelity_index]
-            orion_params[self.fidelity_index] = fidelity_dim.high
+            fidelity_dim = self.space[self.fidelity_index]
+            assert isinstance(fidelity_dim, TransformedDimension)


src/orion/algo/hyperband.py

src/orion/algo/pbt/pb2.py

src/orion/client/runner.py

src/orion/core/utils/random_state.py

tests/functional/algos/test_algos.py

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

src/orion/core/utils/random_state.py

It was previously removed in PR Epistimio#903 because the Consumer is not supposed to create the directory anymore. Merging latest develop branch reintroduced it.

Pass algo object instead so that random state can be updated at end of with-clause.

tests/unittests/core/worker/test_consumer.py

src/orion/core/utils/random_state.py

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

…tfix/pbt_cp_dir

Since release v0.6.0 of pymoo (https://github.com/anyoptimization/pymoo/releases/tag/0.6.0), HEBO's acq_optimizers fail to import because of an ImportError during the module's own imports.

bouthilx added bug Indicates an unexpected problem or unintended behavior high The bug makes a feature unusable labels May 3, 2022

lebrice reviewed May 10, 2022

View reviewed changes

src/orion/core/worker/primary_algo.py Outdated Show resolved Hide resolved

tests/unittests/algo/pbt/test_pbt.py Outdated Show resolved Hide resolved

Delaunay approved these changes May 10, 2022

View reviewed changes

bouthilx added 10 commits June 21, 2022 15:21

Use randint instead of choice in PBT.explore

b061b91

If the list of choices contains multiple types, numpy's RNG will cast all values to the type of the first item.

Do not promote trials at last fidelity in PBT

7729cef

Copy experiment when branching trial

56ef0d4

Fugly fix for trial working dir

c999116

isort

2c3ea09

Blackify

de0fe44

Adapt PBT to new way of trial working_dir copy

f47850c

bouthilx force-pushed the hotfix/pbt_cp_dir branch from 806b722 to 98b08a1 Compare June 22, 2022 22:01

bouthilx added 2 commits June 22, 2022 18:10

pylint

04eb5e8

Adjust functional tests for pbt and pb2

0d06f22

lebrice reviewed Jun 23, 2022

View reviewed changes

bouthilx and others added 11 commits June 23, 2022 14:17

Fix ExperimentClient.workon typing

c357931

Add typing to Runner.__init__

5c150f3

Remove unnecessary is not None

283b87b

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hints to get_original_parent

5113bb4

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hints to branching_rosenbrock

d6d3683

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hints to test_branching_algos

5167330

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Remove unnecessary OrionState

ca6a7cd

Merge branch 'hotfix/pbt_cp_dir' of github.com:bouthilx/orion into ho…

c34d3e8

…tfix/pbt_cp_dir

Add type hints to test_fidelity_upgrades

c05eb25

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Reenable PBT tests

b06ac03

Merge branch 'hotfix/pbt_cp_dir' of github.com:bouthilx/orion into ho…

75dfb72

…tfix/pbt_cp_dir

bouthilx added 4 commits July 15, 2022 15:14

Adjust tests of PBT because generate_offsprings does not raise anymore

1d970e8

Adjust HEBO fixture to new test-suite interface

dab7dc1

Adjust suggest_n for HEBO

a593d7c

lebrice approved these changes Jul 25, 2022

View reviewed changes

bouthilx and others added 7 commits July 26, 2022 11:38

Swap parent classes of Hyperband

5b9b340

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hint

9d5cf4e

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hints to prepare_trial_working_dir

fbed0a3

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hints

23bea94

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Remove deprecated note in docstring

dbd4ac2

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Add type hints

a4790a5

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Fix style

0e702f0

lebrice approved these changes Jul 27, 2022

View reviewed changes

src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved

src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved

bouthilx added 3 commits July 29, 2022 11:10

Merge branch 'develop' into hotfix/pbt_cp_dir

a885b0d

Remove test_trial_working_dir_is_created

cf886dc

It was previously removed in PR Epistimio#903 because the Consumer is not supposed to create the directory anymore. Merging latest develop branch reintroduced it.

Fix control_randomness

d3e4935

Pass algo object instead so that random state can be updated at end of with-clause.

lebrice approved these changes Jul 29, 2022

View reviewed changes

tests/unittests/core/worker/test_consumer.py Show resolved Hide resolved

src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved

bouthilx added 3 commits July 29, 2022 15:37

Appease pylint for Protocol classes...

e7d5b92

Rename Algo protocol to make it more generic

a39c818

Using Protocol from typing_ext for py37

4c94dd2

lebrice reviewed Jul 29, 2022

View reviewed changes

src/orion/core/utils/random_state.py Show resolved Hide resolved

lebrice reviewed Jul 29, 2022

View reviewed changes

src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved

bouthilx and others added 6 commits July 29, 2022 20:12

PB2 does not raise when converging, just like PBT

7ef65b3

Update src/orion/core/utils/random_state.py

0a67e66

Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>

Remove useless asserts

7230143

Merge branch 'hotfix/pbt_cp_dir' of github.com:bouthilx/orion into ho…

e3e27d6

…tfix/pbt_cp_dir

Force pymoo dep to 0.5.0 for HEBO

23bd733

Since release v0.6.0 of pymoo (https://github.com/anyoptimization/pymoo/releases/tag/0.6.0), HEBO's acq_optimizers fail to import because of an ImportError during the module's own imports.

Make test robust to nb of trials

ec910bd

bouthilx merged commit ba4ce25 into Epistimio:develop Aug 2, 2022

notoraptor mentioned this pull request Aug 2, 2022

Release 0.2.5rc #980

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PBT issues with working dir and promotion of max fidelity trials #903

Fix PBT issues with working dir and promotion of max fidelity trials #903

bouthilx commented May 3, 2022 •

edited

lebrice left a comment

bouthilx commented May 10, 2022

lebrice left a comment

lebrice Jul 25, 2022

Fix PBT issues with working dir and promotion of max fidelity trials #903

Fix PBT issues with working dir and promotion of max fidelity trials #903

Conversation

bouthilx commented May 3, 2022 • edited

lebrice left a comment

Choose a reason for hiding this comment

bouthilx commented May 10, 2022

lebrice left a comment

Choose a reason for hiding this comment

lebrice Jul 25, 2022

Choose a reason for hiding this comment

bouthilx commented May 3, 2022 •

edited