Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snooze state display in UI #1578

Merged
merged 15 commits into from
Feb 26, 2024
Merged

Snooze state display in UI #1578

merged 15 commits into from
Feb 26, 2024

Conversation

matbryan52
Copy link
Member

@matbryan52 matbryan52 commented Jan 26, 2024

Had a stab at implementing the cluster snooze state display to respond to #1575 .

image

It is functional but I actually don't know how best to implement the Python side (ironically), as I think this is a unique case where the Python side autonomously triggers a change in the web interface rather than handling a request through the API. In particular the ExecutorState, which is doing the snoozing but has no knowledge of the SharedState or EventRegistry cannot send messages to the web interface. It can be hacked-in but this creates some circular references and weakens the separation of concerns.

Rather than just sit on the proto I thought I would put it up for comments and/or advice!

Right now this only implements the transition to "snoozed" and back to "connected". It does not yet implement the "unsnoozing" in-progress state.

Fixes #1575

Contributor Checklist:

Reviewer Checklist:

  • /azp run libertem.libertem-data passed
  • No import of GPL code from MIT code

@matbryan52 matbryan52 changed the title WIP: Snooze state display in WIP: Snooze state display in UI Jan 26, 2024
Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: Patch coverage is 72.72727% with 63 lines in your changes are missing coverage. Please review.

Project coverage is 66.01%. Comparing base (df6f957) to head (4c6cca0).
Report is 19 commits behind head on master.

Files Patch % Lines
client/src/clusterStatus/components/Cluster.tsx 54.54% 15 Missing ⚠️
client/src/channel/sagas.ts 0.00% 12 Missing ⚠️
client/src/cluster/reducers.ts 42.85% 8 Missing ⚠️
client/src/channel/messages.ts 0.00% 6 Missing ⚠️
client/src/channel/reducers.ts 45.45% 6 Missing ⚠️
client/src/channel/components/ChannelStatus.tsx 50.00% 5 Missing ⚠️
src/libertem/web/analysis.py 55.55% 4 Missing ⚠️
src/libertem/web/engine.py 80.00% 2 Missing ⚠️
src/libertem/web/event_bus.py 92.59% 2 Missing ⚠️
src/libertem/executor/dask.py 75.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1578      +/-   ##
==========================================
- Coverage   70.71%   66.01%   -4.70%     
==========================================
  Files         324      325       +1     
  Lines       26876    27027     +151     
  Branches     3072     3081       +9     
==========================================
- Hits        19005    17842    -1163     
- Misses       7303     8694    +1391     
+ Partials      568      491      -77     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@sk1p sk1p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for getting this started. The TS changes are pretty much what I would have written 👍

I think this is a unique case where the Python side autonomously triggers a change in the web interface rather than handling a request through the API.

Indeed - a similar case are the events emitted from running jobs, and their corresponding progress messages. But it's true that this is initiated by a request, and thus comes through a handler that has access to the whole server-side "state tree".

In particular the ExecutorState, which is doing the snoozing but has no knowledge of the SharedState or EventRegistry cannot send messages to the web interface. It can be hacked-in but this creates some circular references and weakens the separation of concerns.

So, the cycle would be: ExecutorState refs EventRegistry refs ResultEventHandlers refs (SharedState refs ExecutorState) + EventRegistry. The EventRegistry <-> ResultEventHandler cycle is already there, but IMHO okay, as these classes are closely related and have high cohesion.

The "concern weakening" would be that with the change, the EventRegistry is no longer only bound to all the HTTP handlers, but also becomes known to internal components (like ExecutorState), which previously only kept state around. Is this your line of thinking?

Right now, the EventRegistry is quite simple - it's essentially a broadcasting event bus, delivering the messages to all of the websocket connections.

We could think about adding a secondary event bus, which is replicated to the websocket connections via the EventRegistry, but lives on a higher level of the whole object tree - it could be injected into both the EventRegistry and the ExecutorState:

event_bus = EventBus()  # not attached to the name, feel free to bikeshed
event_registry = EventRegistry(event_bus=event_bus)
executor_state = ExecutorState(..., event_bus=event_bus)

def snooze(self):
    # ...
    self.event_bus.send(some_msg)

Then, there would need to be a kind of MessagePump from the internal event bus to the websocket connection for forwarding the messages. I don't know if this event bus would even need to be bi-directional - probably an unidirectional forwarding to the websockets would be enough for now (pattern: multiple producers, single consumer).

This event bus can then be injected into all classes that somehow generate events, even those that don't necessarily have access to the SharedState. (Aside: as I just built a similar thing in another context: in Python it might make sense to use a queue.Queue internally, as it is thread-safe but still can be used in an async context via loop.run_in_executor(...). Specifically, asyncio.Queue is not thread-safe, and can thus also not bridge different async loops (unless you lug around both the queue and the associated loop, and use loop.call_soon_threadsafe, ugh...))

Another topic is that the Message class needs to know about the whole SharedState right now - I think that doesn't need to be the case. Message.state is not used very widely (start_job / finish_job), and Message is mostly meant to keep the message serialization close together. If you like, I can take care of this aspect, as that's just a churny refactoring job, not really related to this PR.

With these changes, the EventRegistry itself doesn't change and isn't coupled to the ExecutorState, there's just an additional component attached to the EventRegistry that pumps asynchronous events from wherever in the system to the websocket connections. There is no additional cycle added, as EventRegistry and ExecutorState are only indirectly connected via the added event bus.

(Is that correct? Something like: ExecutorState refs EventBus, MessagePump refs EventBus, MessagePump refs EventRegistry?)

I hope this makes some kind of sense - I have to think a bit about if the added complexity is worth it, or if there is maybe an easier solution.

client/src/clusterStatus/components/Cluster.tsx Outdated Show resolved Hide resolved
@matbryan52
Copy link
Member Author

Thank you for getting this started. The TS changes are pretty much what I would have written 👍

Thanks, and thanks for taking a look so quickly.

The "concern weakening" would be that with the change, the EventRegistry is no longer only bound to all the HTTP handlers, but also becomes known to internal components (like ExecutorState), which previously only kept state around. Is this your line of thinking?

That's it, yes. There was always the clear distinction between state and API handlers in the server-side code, for me, and this doesn't quite fit into that!

We could think about adding a secondary event bus, which is replicated to the websocket connections via the EventRegistry, but lives on a higher level of the whole object tree - it could be injected into both the EventRegistry and the ExecutorState:
...
Another topic is that the Message class needs to know about the whole SharedState right now - I think that doesn't need to be the case.

This seems like a good solution, a direct channel for broadcasting messages to the client without needing to send the whole application state each time. As for implementing it I could likely do it with enough time, but if you have a good idea of the layout already or want it done quickly then go ahead - push onto this PR or I can rebase this one as you prefer.

@sk1p
Copy link
Member

sk1p commented Feb 4, 2024

/azp run libertem.libertem-data

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sk1p
Copy link
Member

sk1p commented Feb 4, 2024

/azp run libertem.libertem-data

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sk1p sk1p added this to the 0.14 milestone Feb 5, 2024
@matbryan52
Copy link
Member Author

I've added the "unsnoozing" state between "snoozed" and "ready".

However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the unsnooze message does not arrive until after the executor is back up again (again #1577)..

I've also noticed that opening the cluster info modal causes the cluster to unsnooze (which could take a while). It's an edge case but I will look into how hard it will be to avoid this.

@sk1p
Copy link
Member

sk1p commented Feb 8, 2024

I've added the "unsnoozing" state between "snoozed" and "ready".

Great!

However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the unsnooze message does not arrive until after the executor is back up again (again #1577)..

Yup, it's basically this issue we need to tackle in a different way:

async def create_and_set_executor():
if executor_spec is not None:
# Executor creation is blocking (and slow) but we
# need to run this within the main loop so that Distributed
# attaches to that rather than trying to create its own, see:
# https://github.com/LiberTEM/LiberTEM/pull/1535#pullrequestreview-1699340445
# TL;DR: without this, our call to `loop.run_forever` causes
# `RuntimeError: This event loop is already running`
shared_state.create_and_set_executor(executor_spec)

I'll try to find some time to look at this later today.

I've also noticed that opening the cluster info modal causes the cluster to unsnooze (which could take a while). It's an edge case but I will look into how hard it will be to avoid this.

Probably, the "cluster info" would need to be cached, such that it is available without a running executor.

@sk1p sk1p self-assigned this Feb 21, 2024
@matbryan52 matbryan52 changed the title WIP: Snooze state display in UI Snooze state display in UI Feb 21, 2024
`snooze`/`unsnooze`/`make_executor`/`get_executor`/`get_context` and
`create_and_set_executor` are now all async, with some blocking
operations running in the background pool.

Ref LiberTEM#1577
@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the unsnooze message does not arrive until after the executor is back up again (again #1577)..

This should now be fixed, by making the necessary methods async. What I see now on the websocket:

image

I think this did the job, let's see if this passes CI cleanly (older Python/distributed versions might disagree with me...)

@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

CI failure seems unrelated to the changes, caused by distributed==2024.2.1 (works fine with distributed==2024.2.0)

@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

CI failure seems unrelated to the changes, caused by distributed==2024.2.1 (works fine with distributed==2024.2.0)

Tracked down somehow to dask/dask#10883

@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

CI failure seems unrelated to the changes, caused by distributed==2024.2.1 (works fine with distributed==2024.2.0)

Tracked down somehow to dask/dask#10883

... and also, if I log.debug("%r", params_fut) in _get_future, the issue goes away (wtf?). If I just manually call repr(params_fut) in there, the bug is still triggered.

This was reproduced just by running
`tests/executor/test_dask.py::test_fd_limit`, which just runs a `UDF` in
a loop. As this is triggered by the consistent hashing changes in
`dask`, I have the suspicion that the key for the parameters for
different iterations map is the same, and the state somehow gets mixed
up (maybe a race condition where one future gets garbage collected, data
is deleted from the workers, and this overlaps badly with the next
`scatter` operation?)
1) Use the `SpecCluster` as context manager, otherwise the asyncio loop,
   which is running in a background thread, will not be properly stopped

2) Let the `local_cluster_url` fixture depend on the `event_loop`, such
   that the cleanup ordering is correct. Only do this in case of Python
   3.7, as in newer versions, the scope of the `event_loop` fixture has
   changed.
@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

/azp run libertem.libertem-data

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

If we directly modify the request inside of our fixture, it won't take
effect immediately, we need the layer of indirection.
@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

/azp run libertem.libertem-data

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sk1p
Copy link
Member

sk1p commented Feb 26, 2024

Sorry for the noise, the event loop issues on Python 3.7 should be ironed out now - and another workaround for dask has been cherry-picked into #1604 and has already already merged. Possibly these issues need to be revisited for Python 3.8 in #1603.

@matbryan52 feel free to take over again!

@matbryan52 matbryan52 marked this pull request as ready for review February 26, 2024 20:14
@matbryan52
Copy link
Member Author

Sorry for the noise, the event loop issues on Python 3.7 should be ironed out now - and another workaround for dask has been cherry-picked into #1604 and has already already merged. Possibly these issues need to be revisited for Python 3.8 in #1603.

@matbryan52 feel free to take over again!

No worries, and thanks for reviewing / adding the async methods.

For me this is good to go, I've included a rebuilt client now.

@sk1p sk1p merged commit 2d39e78 into LiberTEM:master Feb 26, 2024
23 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GUI: show cluster snooze status
2 participants