You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scalability optimizations in the RP stack present race conditions that make it hard to determine whether a submitted Task will ever change state, whether a callback will ever be called, or whether a component has actually started shutting down in the time between successfully enqueuing a command and the responsible thread processing the command.
We can add some facilities to scalems.radical.session.RuntimeSession to consolidate checks with a minimal number of call-backs and extra tasks.
RP callbacks can set threading.Event attributes directly, and/or loop.call_soon_threadsafe(event.set) for asyncio.Event attributes.
Proposed Event attributes
session_closing
session_closed
pilot_available
pilot_done
The RuntimeSession can register some Pilot callbacks and own some asyncio Tasks to maintain the state.
Periodically (async.sleep at least 1 second) check Session.closed, in case the Session is ended by something external, and set session_closed and pilot_done. Cancel this Task when closing normally.
Wait for session_closed and check that session_closing and pilot_done get set.
Use a Pilot callback to set pilot_available. Run an asyncio.Task to unregister the callback when pilot_available or pilot_done get set. Cancel the task when session_closed.
Use a Pilot callback to set pilot_done when Pilot completes, fails, or is canceled.
Create a asyncio.Task to wait for the first of session_closing, session_closed, pilot_available, or pilot_done, or asyncio.sleep(10). If the sleep finished first, check the Pilot state, in case our callback gets registered too late to catch the state transition of interest, and set pilot_available or pilot_done if appropriate. Otherwise, assume the callbacks are good to go, and return.
We may also want to update the handling of the pilot resources Future. The Task responsible should be canceled if not resolved before pilot_done.
We can separate the pilot() acquisition method once these events are available. RuntimeSession will just have a pilot attribute that is None until the Pilot is successfully submitted (if at all). Clients will have to check for non-null value, since pilot_done needs to be set in case of failure.
Note that this issue will require careful testing. See also #359
The text was updated successfully, but these errors were encountered:
Add and rearrange some program state management. Add some notes and
describe incomplete state management.
Ref SCALE-MS#378, SCALE-MS#383.
- [X] Acquire the Raptor master task through the RuntimeManager.
- [X] Manage a CPI command queue translating CPI calls to
Raptor-backed
Futures (RPTasks or RPC calls)
- [X] Make sure CPI Session is properly shut down. (Partially deferred)
- [ ] Acquire the Worker(s) through CPI call to the RuntimeManager.
- [ ] Normalize the "stop" command to shut down everything cleanly and
expeditiously. (In too many cases right now, tests take an improperly
long time because of various timeouts.)
- [ ] Isolate RPExecutor concurrent.futures.Executor support from
asyncio support (avoid blocking the event loop by avoiding event loop
usage in the main implementation)
Ref SCALE-MS#345, SCALE-MS#377
Scalability optimizations in the RP stack present race conditions that make it hard to determine whether a submitted Task will ever change state, whether a callback will ever be called, or whether a component has actually started shutting down in the time between successfully enqueuing a command and the responsible thread processing the command.
We can add some facilities to scalems.radical.session.RuntimeSession to consolidate checks with a minimal number of call-backs and extra tasks.
RP callbacks can set
threading.Event
attributes directly, and/orloop.call_soon_threadsafe(event.set)
for asyncio.Event attributes.Proposed Event attributes
The RuntimeSession can register some Pilot callbacks and own some asyncio Tasks to maintain the state.
async.sleep
at least 1 second) checkSession.closed
, in case the Session is ended by something external, and set session_closed and pilot_done. Cancel this Task when closing normally.We may also want to update the handling of the pilot
resources
Future. The Task responsible should be canceled if not resolved before pilot_done.We can separate the pilot() acquisition method once these events are available. RuntimeSession will just have a
pilot
attribute that is None until the Pilot is successfully submitted (if at all). Clients will have to check for non-null value, since pilot_done needs to be set in case of failure.Note that this issue will require careful testing. See also #359
The text was updated successfully, but these errors were encountered: