-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snooze state display in UI #1578
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1578 +/- ##
==========================================
- Coverage 70.71% 66.01% -4.70%
==========================================
Files 324 325 +1
Lines 26876 27027 +151
Branches 3072 3081 +9
==========================================
- Hits 19005 17842 -1163
- Misses 7303 8694 +1391
+ Partials 568 491 -77 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for getting this started. The TS changes are pretty much what I would have written 👍
I think this is a unique case where the Python side autonomously triggers a change in the web interface rather than handling a request through the API.
Indeed - a similar case are the events emitted from running jobs, and their corresponding progress messages. But it's true that this is initiated by a request, and thus comes through a handler that has access to the whole server-side "state tree".
In particular the ExecutorState, which is doing the snoozing but has no knowledge of the SharedState or EventRegistry cannot send messages to the web interface. It can be hacked-in but this creates some circular references and weakens the separation of concerns.
So, the cycle would be: ExecutorState
refs EventRegistry
refs ResultEventHandler
s refs (SharedState
refs ExecutorState
) + EventRegistry
. The EventRegistry <-> ResultEventHandler
cycle is already there, but IMHO okay, as these classes are closely related and have high cohesion.
The "concern weakening" would be that with the change, the EventRegistry
is no longer only bound to all the HTTP handlers, but also becomes known to internal components (like ExecutorState
), which previously only kept state around. Is this your line of thinking?
Right now, the EventRegistry
is quite simple - it's essentially a broadcasting event bus, delivering the messages to all of the websocket connections.
We could think about adding a secondary event bus, which is replicated to the websocket connections via the EventRegistry
, but lives on a higher level of the whole object tree - it could be injected into both the EventRegistry
and the ExecutorState
:
event_bus = EventBus() # not attached to the name, feel free to bikeshed
event_registry = EventRegistry(event_bus=event_bus)
executor_state = ExecutorState(..., event_bus=event_bus)
def snooze(self):
# ...
self.event_bus.send(some_msg)
Then, there would need to be a kind of MessagePump
from the internal event bus to the websocket connection for forwarding the messages. I don't know if this event bus would even need to be bi-directional - probably an unidirectional forwarding to the websockets would be enough for now (pattern: multiple producers, single consumer).
This event bus can then be injected into all classes that somehow generate events, even those that don't necessarily have access to the SharedState
. (Aside: as I just built a similar thing in another context: in Python it might make sense to use a queue.Queue
internally, as it is thread-safe but still can be used in an async context via loop.run_in_executor(...)
. Specifically, asyncio.Queue
is not thread-safe, and can thus also not bridge different async loops (unless you lug around both the queue and the associated loop, and use loop.call_soon_threadsafe
, ugh...))
Another topic is that the Message
class needs to know about the whole SharedState
right now - I think that doesn't need to be the case. Message.state
is not used very widely (start_job
/ finish_job
), and Message
is mostly meant to keep the message serialization close together. If you like, I can take care of this aspect, as that's just a churny refactoring job, not really related to this PR.
With these changes, the EventRegistry
itself doesn't change and isn't coupled to the ExecutorState
, there's just an additional component attached to the EventRegistry
that pumps asynchronous events from wherever in the system to the websocket connections. There is no additional cycle added, as EventRegistry
and ExecutorState
are only indirectly connected via the added event bus.
(Is that correct? Something like: ExecutorState
refs EventBus
, MessagePump
refs EventBus
, MessagePump
refs EventRegistry
?)
I hope this makes some kind of sense - I have to think a bit about if the added complexity is worth it, or if there is maybe an easier solution.
Thanks, and thanks for taking a look so quickly.
That's it, yes. There was always the clear distinction between state and API handlers in the server-side code, for me, and this doesn't quite fit into that!
This seems like a good solution, a direct channel for broadcasting messages to the client without needing to send the whole application state each time. As for implementing it I could likely do it with enough time, but if you have a good idea of the layout already or want it done quickly then go ahead - push onto this PR or I can rebase this one as you prefer. |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
I've added the "unsnoozing" state between "snoozed" and "ready". However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the I've also noticed that opening the cluster info modal causes the cluster to unsnooze (which could take a while). It's an edge case but I will look into how hard it will be to avoid this. |
Great!
Yup, it's basically this issue we need to tackle in a different way: LiberTEM/src/libertem/web/server.py Lines 244 to 252 in 5d4d2de
I'll try to find some time to look at this later today.
Probably, the "cluster info" would need to be cached, such that it is available without a running executor. |
5d71f6e
to
027c6eb
Compare
`snooze`/`unsnooze`/`make_executor`/`get_executor`/`get_context` and `create_and_set_executor` are now all async, with some blocking operations running in the background pool. Ref LiberTEM#1577
This should now be fixed, by making the necessary methods I think this did the job, let's see if this passes CI cleanly (older Python/distributed versions might disagree with me...) |
CI failure seems unrelated to the changes, caused by |
Tracked down somehow to dask/dask#10883 |
... and also, if I |
This was reproduced just by running `tests/executor/test_dask.py::test_fd_limit`, which just runs a `UDF` in a loop. As this is triggered by the consistent hashing changes in `dask`, I have the suspicion that the key for the parameters for different iterations map is the same, and the state somehow gets mixed up (maybe a race condition where one future gets garbage collected, data is deleted from the workers, and this overlaps badly with the next `scatter` operation?)
1) Use the `SpecCluster` as context manager, otherwise the asyncio loop, which is running in a background thread, will not be properly stopped 2) Let the `local_cluster_url` fixture depend on the `event_loop`, such that the cleanup ordering is correct. Only do this in case of Python 3.7, as in newer versions, the scope of the `event_loop` fixture has changed.
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
If we directly modify the request inside of our fixture, it won't take effect immediately, we need the layer of indirection.
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
Sorry for the noise, the event loop issues on Python 3.7 should be ironed out now - and another workaround for dask has been cherry-picked into #1604 and has already already merged. Possibly these issues need to be revisited for Python 3.8 in #1603. @matbryan52 feel free to take over again! |
No worries, and thanks for reviewing / adding the async methods. For me this is good to go, I've included a rebuilt client now. |
Had a stab at implementing the cluster snooze state display to respond to #1575 .
It is functional but I actually don't know how best to implement the Python side (ironically), as I think this is a unique case where the Python side autonomously triggers a change in the web interface rather than handling a request through the API. In particular the
ExecutorState
, which is doing the snoozing but has no knowledge of theSharedState
orEventRegistry
cannot send messages to the web interface. It can be hacked-in but this creates some circular references and weakens the separation of concerns.Rather than just sit on the proto I thought I would put it up for comments and/or advice!
Right now this only implements the transition to "snoozed" and back to "connected". It does not yet implement the "unsnoozing" in-progress state.
Fixes #1575
Contributor Checklist:
Reviewer Checklist:
/azp run libertem.libertem-data
passed