-
Notifications
You must be signed in to change notification settings - Fork 73
WIP: Parallel Trial Scheduler #971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
bpkroth
wants to merge
44
commits into
microsoft:main
Choose a base branch
from
bpkroth:parallel-scheduler-fork
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…to parallel_schedular
for more information, see https://pre-commit.ci
Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>
Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…to parallel_schedular
This is part of an attempt to try and see if can work around issues with `multiprocessing.Pool` needing to pickle certain objects when forking. For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable. Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work. See Also microsoft#967
…parallel-scheduler-fork
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This was referenced May 9, 2025
bpkroth
added a commit
that referenced
this pull request
May 12, 2025
# Pull Request ## Title Delay entering `TrialRunner` context until `run_trial`. ______________________________________________________________________ ## Description This is part of an attempt to try and see if can work around issues with `multiprocessing.Pool` needing to pickle certain objects when forking. For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable. Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work. See Also #967 ______________________________________________________________________ ## Type of Change - 🛠️ Bug fix - 🔄 Refactor ______________________________________________________________________ ## Testing - Light so far (still in draft mode) - Just basic existing CI tests (seems to not break anything) ______________________________________________________________________ ## Additional Notes (optional) I think this is incomplete. To support forking inside the Scheduler and *then* entering the context of the given TrialRunner, we may also need to do something about the Scheduler's Storage object. That was true, those PRs are now forthcoming. See Also #971 For now this is a draft PR to allow @jsfreischuetz and I to play with alternative organizations of #967. ______________________________________________________________________
motus
pushed a commit
that referenced
this pull request
May 12, 2025
# Pull Request ## Title Rename and reorganize some Scheduler methods ______________________________________________________________________ ## Description In preparation for ParallelScheduler PR (#971) this PR 1. Renames a few methods for clarity (e.g., `schedule_trial` --> `add_trial_to_queue`) so that the role of other methods (e.g., `assign_trial_runners`) is more clear and readily overridable. 2. Moves some methods to the base class. This is done based on an observation that some of the methods (e.g., the main `start` loop) are largely reusable across both SyncScheduler and ParallelScheduler if the method separation is done a little bit more fine grained such that each can override the only the parts they need to. Additionall, it also 1. Also adds a subtle tweak on the `pending_trials` method in order to filter based on whether or not the trial already has a runner assigned or not. 2. Adjusts `TrialRunner.run_trial` to return the results of the Trial. This isn't really used anywhere, but it is helpful to check that TrialRunners run in a child process finished successfully by using that result as a check in the return value, even though for Optimizer bulk_registering it actually pulls the results back off of Storage in the MainProcess. ______________________________________________________________________ ## Type of Change - 🔄 Refactor - 🧪 Tests ______________________________________________________________________ ## Testing - [x] New CI tests for tweaks in `pending_trials` - Existing CI tests ______________________________________________________________________ ## Additional Notes (optional) ______________________________________________________________________ --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
motus
added a commit
that referenced
this pull request
May 20, 2025
…es. (#979) # Pull Request ## Title Refactor Scheduler schema definitions to make it easier to add new ones. ______________________________________________________________________ ## Description Simple refactor of the Scheduler schemas to allow making new ones as a copy/edit of the `synchscheduler-subschema.json`. ______________________________________________________________________ ## Type of Change - 🔄 Refactor ______________________________________________________________________ ## Testing - Exisiting CI checks for good and bad scheduler config files still pass. - Adding additional tests to actually load those configs with `mlos_bench` next. ______________________________________________________________________ ## Additional Notes (optional) - Splitting some new testing infra out from #973. - Adding `MockScheduler` next, then `ParallelScheduler` in #971. ______________________________________________________________________ --------- Co-authored-by: Sergiy Matusevych <sergiym@microsoft.com>
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
Title
Parallel Trial Scheduler Implementation
Description
Work towards parallel trial execution.
Uses
multiprocessing.Pool
to run trials inTrialRunners
in parallel.Requires a few other changes to make that possible via pickling objects:
(to be split out to separate PRs)
Storage
objects picklable by dropping the current engine connection and recreating it via__getstate__
and__setstat__
Trial
object to be reconstructed fromStorage
viaexp_id
,trial_id
For better understanding and reuse, refactors some things:
schedule
-->add_to_queue
, vs.assign_trial_runners
): Rename and reorganize some Scheduler methods #975wait_for_trial_runners()
method that is a no-op in the base class, but handles asynchronous runners in theParallelTrialScheduler
Needs lots of tests
Closes #380
Type of Change
Testing
Additional Notes (optional)
This is meant as a draft for now to share the design ideas.
It builds off of #939 and #967 and #970.