WIP: Parallel Trial Scheduler #971

bpkroth · 2025-05-02T23:23:53Z

Pull Request

Title

Parallel Trial Scheduler Implementation

Description

Work towards parallel trial execution.

Uses multiprocessing.Pool to run trials in TrialRunners in parallel.
Requires a few other changes to make that possible via pickling objects:
(to be split out to separate PRs)
- Make Storage objects picklable by dropping the current engine connection and recreating it via __getstate__ and __setstat__
- Allows a Trial object to be reconstructed from Storage via exp_id, trial_id
- See Make Storage objects picklable and resumable by mp.Pool workers #974
For better understanding and reuse, refactors some things:
- Rename some functions for clarity (e.g., schedule --> add_to_queue, vs. assign_trial_runners): Rename and reorganize some Scheduler methods #975
- Moves some common functions to the base class: Rename and reorganize some Scheduler methods #975
- Adds a new wait_for_trial_runners() method that is a no-op in the base class, but handles asynchronous runners in the ParallelTrialScheduler
- Adjusts some of the Trial results re-loading logic to be able to handle out-of-order trial executions: WIP: Prepare Experiment.load to handle async out of order trial completion #973
- Adds some vscode settings for new copilot instructions and the start of some custom prompts: Adding VSCode Copilot integrations #972
Needs lots of tests

Closes #380

Type of Change

✨ New feature
⚠️ Breaking change
🔄 Refactor
📝 Documentation update
🧪 Tests

Testing

TODO

Additional Notes (optional)

This is meant as a draft for now to share the design ideas.
It builds off of #939 and #967 and #970.

for more information, see https://pre-commit.ci

…to parallel_schedular

for more information, see https://pre-commit.ci

Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>

for more information, see https://pre-commit.ci

…to parallel_schedular

This is part of an attempt to try and see if can work around issues with `multiprocessing.Pool` needing to pickle certain objects when forking. For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable. Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work. See Also microsoft#967

…parallel-scheduler-fork

for more information, see https://pre-commit.ci

@jsfreischuetz

# Pull Request ## Title Delay entering `TrialRunner` context until `run_trial`. ______________________________________________________________________ ## Description This is part of an attempt to try and see if can work around issues with `multiprocessing.Pool` needing to pickle certain objects when forking. For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable. Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work. See Also #967 ______________________________________________________________________ ## Type of Change - 🛠️ Bug fix - 🔄 Refactor ______________________________________________________________________ ## Testing - Light so far (still in draft mode) - Just basic existing CI tests (seems to not break anything) ______________________________________________________________________ ## Additional Notes (optional) I think this is incomplete. To support forking inside the Scheduler and *then* entering the context of the given TrialRunner, we may also need to do something about the Scheduler's Storage object. That was true, those PRs are now forthcoming. See Also #971 For now this is a draft PR to allow @jsfreischuetz and I to play with alternative organizations of #967. ______________________________________________________________________

# Pull Request ## Title Rename and reorganize some Scheduler methods ______________________________________________________________________ ## Description In preparation for ParallelScheduler PR (#971) this PR 1. Renames a few methods for clarity (e.g., `schedule_trial` --> `add_trial_to_queue`) so that the role of other methods (e.g., `assign_trial_runners`) is more clear and readily overridable. 2. Moves some methods to the base class. This is done based on an observation that some of the methods (e.g., the main `start` loop) are largely reusable across both SyncScheduler and ParallelScheduler if the method separation is done a little bit more fine grained such that each can override the only the parts they need to. Additionall, it also 1. Also adds a subtle tweak on the `pending_trials` method in order to filter based on whether or not the trial already has a runner assigned or not. 2. Adjusts `TrialRunner.run_trial` to return the results of the Trial. This isn't really used anywhere, but it is helpful to check that TrialRunners run in a child process finished successfully by using that result as a check in the return value, even though for Optimizer bulk_registering it actually pulls the results back off of Storage in the MainProcess. ______________________________________________________________________ ## Type of Change - 🔄 Refactor - 🧪 Tests ______________________________________________________________________ ## Testing - [x] New CI tests for tweaks in `pending_trials` - Existing CI tests ______________________________________________________________________ ## Additional Notes (optional) ______________________________________________________________________ --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…es. (#979) # Pull Request ## Title Refactor Scheduler schema definitions to make it easier to add new ones. ______________________________________________________________________ ## Description Simple refactor of the Scheduler schemas to allow making new ones as a copy/edit of the `synchscheduler-subschema.json`. ______________________________________________________________________ ## Type of Change - 🔄 Refactor ______________________________________________________________________ ## Testing - Exisiting CI checks for good and bad scheduler config files still pass. - Adding additional tests to actually load those configs with `mlos_bench` next. ______________________________________________________________________ ## Additional Notes (optional) - Splitting some new testing infra out from #973. - Adding `MockScheduler` next, then `ParallelScheduler` in #971. ______________________________________________________________________ --------- Co-authored-by: Sergiy Matusevych <sergiym@microsoft.com>

jsfreischuetz and others added 28 commits April 23, 2025 10:39

add parallel schedular

8b6e172

[pre-commit.ci] auto fixes from pre-commit.com hooks

1373ff1

for more information, see https://pre-commit.ci

Merge branch 'microsoft:main' into parallel_schedular

5ff16c3

alternative implementation for threads

b45c71e

[pre-commit.ci] auto fixes from pre-commit.com hooks

a326cdf

for more information, see https://pre-commit.ci

add comments

8d34686

Merge branch 'parallel_schedular' of github.com:jsfreischuetz/MLOS in…

8c29920

…to parallel_schedular

switch from threads to processes

4bbb534

[pre-commit.ci] auto fixes from pre-commit.com hooks

123ecd3

for more information, see https://pre-commit.ci

Update mlos_bench/mlos_bench/schedulers/parallel_scheduler.py

3357fcf

Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>

Update mlos_bench/mlos_bench/storage/sql/experiment.py

7a65629

Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c9ecb8

for more information, see https://pre-commit.ci

updates for comments

40c7f83

merge

1f9e466

fix linting errors

ee9e7e0

[pre-commit.ci] auto fixes from pre-commit.com hooks

d619999

for more information, see https://pre-commit.ci

try to fix the docs

3dfb7de

Merge branch 'parallel_schedular' of github.com:jsfreischuetz/MLOS in…

bb239e8

…to parallel_schedular

Merge remote-tracking branch 'jsfreischuetz/parallel_schedular' into …

73900b2

…parallel-scheduler-fork

restore original trial runner

5ea30c9

remove scheduled state

4f56ec4

adding some todo and refactoring notes and comments

04b9266

prepping pickability of Storage

d5f59af

start some other consolidation of other functions

3cf25c9

start consolidating some things toward the base class

3a37e9b

support returning data from run_trial

f7fcecf

wip: refactoring and implementation for parallel scheduler

e8635d4

bpkroth added the WIP label May 2, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8c3d61

for more information, see https://pre-commit.ci

bpkroth and others added 11 commits May 8, 2025 15:00

add copilot instructions customizations

c79df57

adding some custom prompts for github copilot to use

3815ed1

[pre-commit.ci] auto fixes from pre-commit.com hooks

f177f56

for more information, see https://pre-commit.ci

don't remove unused imports - it's annoying

b482564

Make sure to handle stragger trials when re-loading the storage

dd2faee

stubbing out 'wait_for_trial_runners' support in base class

24cdf0c

make to wait for all trial runners to finish

ff5fcc2

comments

30ee291

[pre-commit.ci] auto fixes from pre-commit.com hooks

2b21a0d

for more information, see https://pre-commit.ci

rework instructions a bit

50b9706

[pre-commit.ci] auto fixes from pre-commit.com hooks

816ada1

for more information, see https://pre-commit.ci

bpkroth mentioned this pull request May 9, 2025

Delay entering TrialRunner context until run_trial #970

Merged

bpkroth and others added 3 commits May 9, 2025 13:11

Adding new prompt

77424cb

update the prompt

06ab189

[pre-commit.ci] auto fixes from pre-commit.com hooks

073f9bf

for more information, see https://pre-commit.ci

This was referenced May 9, 2025

WIP: Prepare Experiment.load to handle async out of order trial completion #973

Draft

Rename and reorganize some Scheduler methods #975

Merged

Add a Parallel Task Schedular #967

Closed

Merge branch 'main' into parallel-scheduler-fork

ba25ecb

bpkroth mentioned this pull request May 19, 2025

Refactor Scheduler schema definitions to make it easier to add new ones. #979

Merged

bpkroth mentioned this pull request May 22, 2025

Introduce mock_trial_data for MockEnv and add some basic Scheduler tests #980

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Parallel Trial Scheduler #971

WIP: Parallel Trial Scheduler #971

Uh oh!

bpkroth commented May 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

WIP: Parallel Trial Scheduler #971

Are you sure you want to change the base?

WIP: Parallel Trial Scheduler #971

Uh oh!

Conversation

bpkroth commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Title

Description

Type of Change

Testing

Additional Notes (optional)

Uh oh!

Uh oh!

bpkroth commented May 2, 2025 •

edited

Loading