Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify tracking of RP session states. #383

Open
eirrgang opened this issue Aug 4, 2023 · 0 comments
Open

Unify tracking of RP session states. #383

eirrgang opened this issue Aug 4, 2023 · 0 comments
Assignees

Comments

@eirrgang
Copy link
Contributor

eirrgang commented Aug 4, 2023

Scalability optimizations in the RP stack present race conditions that make it hard to determine whether a submitted Task will ever change state, whether a callback will ever be called, or whether a component has actually started shutting down in the time between successfully enqueuing a command and the responsible thread processing the command.

We can add some facilities to scalems.radical.session.RuntimeSession to consolidate checks with a minimal number of call-backs and extra tasks.

RP callbacks can set threading.Event attributes directly, and/or loop.call_soon_threadsafe(event.set) for asyncio.Event attributes.

Proposed Event attributes

  • session_closing
  • session_closed
  • pilot_available
  • pilot_done

The RuntimeSession can register some Pilot callbacks and own some asyncio Tasks to maintain the state.

  • Periodically (async.sleep at least 1 second) check Session.closed, in case the Session is ended by something external, and set session_closed and pilot_done. Cancel this Task when closing normally.
  • Wait for session_closed and check that session_closing and pilot_done get set.
  • Use a Pilot callback to set pilot_available. Run an asyncio.Task to unregister the callback when pilot_available or pilot_done get set. Cancel the task when session_closed.
  • Use a Pilot callback to set pilot_done when Pilot completes, fails, or is canceled.
  • Create a asyncio.Task to wait for the first of session_closing, session_closed, pilot_available, or pilot_done, or asyncio.sleep(10). If the sleep finished first, check the Pilot state, in case our callback gets registered too late to catch the state transition of interest, and set pilot_available or pilot_done if appropriate. Otherwise, assume the callbacks are good to go, and return.

We may also want to update the handling of the pilot resources Future. The Task responsible should be canceled if not resolved before pilot_done.

We can separate the pilot() acquisition method once these events are available. RuntimeSession will just have a pilot attribute that is None until the Pilot is successfully submitted (if at all). Clients will have to check for non-null value, since pilot_done needs to be set in case of failure.

Note that this issue will require careful testing. See also #359

@eirrgang eirrgang self-assigned this Aug 4, 2023
eirrgang added a commit that referenced this issue Aug 9, 2023
eirrgang added a commit to eirrgang/scale-ms that referenced this issue Aug 9, 2023
Add and rearrange some program state management. Add some notes and
describe incomplete state management.

Ref SCALE-MS#378, SCALE-MS#383.

- [X] Acquire the Raptor master task through the RuntimeManager.
- [X] Manage a CPI command queue translating CPI calls to
Raptor-backed
  Futures (RPTasks or RPC calls)
- [X] Make sure CPI Session is properly shut down. (Partially deferred)
- [ ] Acquire the Worker(s) through CPI call to the RuntimeManager.
- [ ] Normalize the "stop" command to shut down everything cleanly and
  expeditiously. (In too many cases right now, tests take an improperly
  long time because of various timeouts.)
- [ ] Isolate RPExecutor concurrent.futures.Executor support from
  asyncio support (avoid blocking the event loop by avoiding event loop
  usage in the main implementation)

Ref SCALE-MS#345, SCALE-MS#377
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant