Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PBT issues with working dir and promotion of max fidelity trials #903

Merged
merged 99 commits into from
Aug 2, 2022

Conversation

bouthilx
Copy link
Member

@bouthilx bouthilx commented May 3, 2022

@Delaunay @lebrice This includes a fugly fix to working dir paths of trials that depends on the trial id which is different in transformed space and original space. To makes things worse, the ID of the trial depends on the experiment ID which is only set outsite of the algorithm by the experiment object after the trial was suggested. Because of this we need to rename directories or create sym-links at the different levels (In algo, in algo-wrapper, in experiment) to make sure that the working dir of the trial is corresponding to the one communicated to the user through trial.working_dir or env var ORION_WORKING_DIR.

We could get rid of 1 level by making the trial id not depend on experiment. In the database the index of trials should depend on trial.id and trial.experiment however to make sure trials are not duplicated within a given experiment. That would require updates of database from the side of users after next release however.

I don't see a simple and clean fix however for the different working_dir inside the algorithm and inside the algo-wrapper. The parameters are different and thus the ID and working_dir is different.

Would you have any ideas?

@bouthilx bouthilx added bug Indicates an unexpected problem or unintended behavior high The bug makes a feature unusable labels May 3, 2022
Copy link
Collaborator

@lebrice lebrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the discussion we had on Slack:

The PBT algorithm shouldn deal with copying / creating directories for trials, since it doesn't have any way of knowing which trials are actually new.

This responsibility should be moved to the Experiment, which would use the lineage information and the experiment id to create the working directory of the trial before it is passed to the User, and copy the files as necessary.

src/orion/core/worker/primary_algo.py Outdated Show resolved Hide resolved
tests/unittests/algo/pbt/test_pbt.py Outdated Show resolved Hide resolved
@bouthilx
Copy link
Member Author

Actually it should be the Runner's responsibility considering coming changes with #911. The trials will be generated on a remote server and thus the directory creation/copy should not occur during the trial generation on the server but rather in the Runner which is assumed to share the same file system as workers.

If the list of choices contains multiple types, numpy's RNG will cast
all values to the type of the first item.
Why:

When the trial has a parent, it means it should start from the same
working_dir state. We need to fetch the parent trial and copy its working
dir to the current trial's working dir.

How:

The runner now has a new argument, a callable that will be executed
before a trial is submitted to the executor. By default this callable
will take care of copying parent trial working_dir to child trial
working dir, or simply create the working dir if the trial has no
parent.

If the callable fails, the exception is caught and delayed to be caught
again during the executor's execution of the trial. This makes it
possible to handle these trials the same way as if they crashed during
trial execution rather than during trial preparation.
Why:

The parent id is in transformed space. When converting a trial from
transformed space to original space, the id of the parent must be
converted too. If the parent has a parent too, then this grand-parent
must be converted too, and so on.

How:

Recursively get to the root trial, and then compute the trial ids down
to last parent trial.

Tests are added to verify this.
Why:

The number of trials executed `self.trials` is only incremented when
trials are completed successfully, not when they crash. Thus, computing
`self.trials - self.worker_broken_trials` would not count properly the
number of completed trials by this runner.
Copy link
Collaborator

@lebrice lebrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments.

src/orion/client/experiment.py Outdated Show resolved Hide resolved
src/orion/client/runner.py Show resolved Hide resolved
src/orion/client/runner.py Outdated Show resolved Hide resolved
src/orion/core/worker/primary_algo.py Outdated Show resolved Hide resolved
src/orion/core/worker/primary_algo.py Outdated Show resolved Hide resolved
tests/unittests/client/test_runner.py Outdated Show resolved Hide resolved
tests/unittests/client/test_runner.py Show resolved Hide resolved
tests/unittests/client/test_runner.py Outdated Show resolved Hide resolved
tests/unittests/client/test_runner.py Outdated Show resolved Hide resolved
tests/unittests/core/test_primary_algo.py Outdated Show resolved Hide resolved
bouthilx and others added 11 commits June 23, 2022 14:17
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
It should not assume that algorithms are able to sample 2 trials at the
same time. Therefore we won't test that it returns different trials one
after the other. Only that seeding gives the same trial. The convergence
test will figure it out if the algorithm always return the same trial.
src/orion/algo/axoptimizer.py Outdated Show resolved Hide resolved
fidelity_dim: Fidelity = self.space[self.fidelity_index]
orion_params[self.fidelity_index] = fidelity_dim.high
fidelity_dim = self.space[self.fidelity_index]
assert isinstance(fidelity_dim, TransformedDimension)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

src/orion/algo/hyperband.py Outdated Show resolved Hide resolved
src/orion/algo/pbt/pb2.py Show resolved Hide resolved
src/orion/client/runner.py Outdated Show resolved Hide resolved
src/orion/client/runner.py Outdated Show resolved Hide resolved
src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved
src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved
src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved
tests/functional/algos/test_algos.py Outdated Show resolved Hide resolved
bouthilx and others added 7 commits July 26, 2022 11:38
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
Co-authored-by: Fabrice Normandin <fabrice.normandin@gmail.com>
src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved
src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved
It was previously removed in PR Epistimio#903 because the Consumer is not
supposed to create the directory anymore. Merging latest develop branch
reintroduced it.
Pass algo object instead so that random state can be updated at end of
with-clause.
tests/unittests/core/worker/test_consumer.py Show resolved Hide resolved
src/orion/core/utils/random_state.py Outdated Show resolved Hide resolved
@bouthilx bouthilx merged commit ba4ce25 into Epistimio:develop Aug 2, 2022
@notoraptor notoraptor mentioned this pull request Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior high The bug makes a feature unusable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants