Skip to content

Conversation

bpkroth
Copy link
Contributor

@bpkroth bpkroth commented May 2, 2025

Pull Request

Title

Parallel Trial Scheduler Implementation


Description

Work towards parallel trial execution.

Closes #380


Type of Change

  • ✨ New feature
  • ⚠️ Breaking change
  • 🔄 Refactor
  • 📝 Documentation update
  • 🧪 Tests

Testing

  • TODO

Additional Notes (optional)

This is meant as a draft for now to share the design ideas.
It builds off of #939 and #967 and #970.


jsfreischuetz and others added 28 commits April 23, 2025 10:39
Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>
Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>
This is part of an attempt to try and see if can work around issues with
`multiprocessing.Pool` needing to pickle certain objects when forking.

For instance, if the Environment is using an SshServer, we need to start an
EventLoopContext in the background to handle the SSH connections and threads are
not picklable.

Nor are file handles, DB connections, etc., so there may be other things we also
need to adjust to make this work.

See Also microsoft#967
@bpkroth bpkroth added the WIP Work in progress - do not merge yet label May 2, 2025
bpkroth added a commit that referenced this pull request May 12, 2025
# Pull Request

## Title

Delay entering `TrialRunner` context until `run_trial`.

______________________________________________________________________

## Description

This is part of an attempt to try and see if can work around issues with
`multiprocessing.Pool` needing to pickle certain objects when forking.

For instance, if the Environment is using an SshServer, we need to start
an EventLoopContext in the background to handle the SSH connections and
threads are not picklable.

Nor are file handles, DB connections, etc., so there may be other things
we also need to adjust to make this work.

See Also #967

______________________________________________________________________

## Type of Change


- 🛠️ Bug fix
- 🔄 Refactor

______________________________________________________________________

## Testing

- Light so far (still in draft mode)
- Just basic existing CI tests (seems to not break anything)

______________________________________________________________________

## Additional Notes (optional)

I think this is incomplete. To support forking inside the Scheduler and
*then* entering the context of the given TrialRunner, we may also need
to do something about the Scheduler's Storage object.

That was true, those PRs are now forthcoming.  See Also #971 

For now this is a draft PR to allow @jsfreischuetz and I to play with
alternative organizations of #967.

______________________________________________________________________
motus pushed a commit that referenced this pull request May 12, 2025
# Pull Request

## Title

Rename and reorganize some Scheduler methods

______________________________________________________________________

## Description

In preparation for ParallelScheduler PR (#971) this PR

1. Renames a few methods for clarity (e.g., `schedule_trial` -->
`add_trial_to_queue`) so that the role of other methods (e.g.,
`assign_trial_runners`) is more clear and readily overridable.
2. Moves some methods to the base class.
This is done based on an observation that some of the methods (e.g., the
main `start` loop) are largely reusable across both SyncScheduler and
ParallelScheduler if the method separation is done a little bit more
fine grained such that each can override the only the parts they need
to.

Additionall, it also

1. Also adds a subtle tweak on the `pending_trials` method in order to
filter based on whether or not the trial already has a runner assigned
or not.
2. Adjusts `TrialRunner.run_trial` to return the results of the Trial.
This isn't really used anywhere, but it is helpful to check that
TrialRunners run in a child process finished successfully by using that
result as a check in the return value, even though for Optimizer
bulk_registering it actually pulls the results back off of Storage in
the MainProcess.

______________________________________________________________________

## Type of Change

- 🔄 Refactor
- 🧪 Tests

______________________________________________________________________

## Testing

- [x] New CI tests for tweaks in `pending_trials`
- Existing CI tests

______________________________________________________________________

## Additional Notes (optional)


______________________________________________________________________

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
motus added a commit that referenced this pull request May 20, 2025
…es. (#979)

# Pull Request

## Title

Refactor Scheduler schema definitions to make it easier to add new ones.

______________________________________________________________________

## Description

Simple refactor of the Scheduler schemas to allow making new ones as a
copy/edit of the `synchscheduler-subschema.json`.

______________________________________________________________________

## Type of Change

- 🔄 Refactor

______________________________________________________________________

## Testing

- Exisiting CI checks for good and bad scheduler config files still
pass.
- Adding additional tests to actually load those configs with
`mlos_bench` next.

______________________________________________________________________

## Additional Notes (optional)

- Splitting some new testing infra out from #973.
- Adding `MockScheduler` next, then `ParallelScheduler` in #971.

______________________________________________________________________

---------

Co-authored-by: Sergiy Matusevych <sergiym@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress - do not merge yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parallel trial execution
2 participants