Snooze state display in UI #1578

matbryan52 · 2024-01-26T13:47:21Z

Had a stab at implementing the cluster snooze state display to respond to #1575 .

It is functional but I actually don't know how best to implement the Python side (ironically), as I think this is a unique case where the Python side autonomously triggers a change in the web interface rather than handling a request through the API. In particular the ExecutorState, which is doing the snoozing but has no knowledge of the SharedState or EventRegistry cannot send messages to the web interface. It can be hacked-in but this creates some circular references and weakens the separation of concerns.

Rather than just sit on the proto I thought I would put it up for comments and/or advice!

Right now this only implements the transition to "snoozed" and back to "connected". It does not yet implement the "unsnoozing" in-progress state.

Fixes #1575

Contributor Checklist:

I have added or updated my entry in the creators.json file
I have added a changelog entry for my contribution
I have added/updated documentation for all user-facing changes
I have added/updated test cases
I have included the rebuilt production build of the client (only if changes were made to the GUI)

Reviewer Checklist:

/azp run libertem.libertem-data passed
No import of GPL code from MIT code

codecov · 2024-01-26T13:58:56Z

Codecov Report

Attention: Patch coverage is 72.72727% with 63 lines in your changes are missing coverage. Please review.

Project coverage is 66.01%. Comparing base (df6f957) to head (4c6cca0).
Report is 19 commits behind head on master.

Files	Patch %	Lines
client/src/clusterStatus/components/Cluster.tsx	54.54%	15 Missing ⚠️
client/src/channel/sagas.ts	0.00%	12 Missing ⚠️
client/src/cluster/reducers.ts	42.85%	8 Missing ⚠️
client/src/channel/messages.ts	0.00%	6 Missing ⚠️
client/src/channel/reducers.ts	45.45%	6 Missing ⚠️
client/src/channel/components/ChannelStatus.tsx	50.00%	5 Missing ⚠️
src/libertem/web/analysis.py	55.55%	4 Missing ⚠️
src/libertem/web/engine.py	80.00%	2 Missing ⚠️
src/libertem/web/event_bus.py	92.59%	2 Missing ⚠️
src/libertem/executor/dask.py	75.00%	1 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1578      +/-   ##
==========================================
- Coverage   70.71%   66.01%   -4.70%     
==========================================
  Files         324      325       +1     
  Lines       26876    27027     +151     
  Branches     3072     3081       +9     
==========================================
- Hits        19005    17842    -1163     
- Misses       7303     8694    +1391     
+ Partials      568      491      -77

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sk1p

Thank you for getting this started. The TS changes are pretty much what I would have written 👍

I think this is a unique case where the Python side autonomously triggers a change in the web interface rather than handling a request through the API.

Indeed - a similar case are the events emitted from running jobs, and their corresponding progress messages. But it's true that this is initiated by a request, and thus comes through a handler that has access to the whole server-side "state tree".

In particular the ExecutorState, which is doing the snoozing but has no knowledge of the SharedState or EventRegistry cannot send messages to the web interface. It can be hacked-in but this creates some circular references and weakens the separation of concerns.

So, the cycle would be: ExecutorState refs EventRegistry refs ResultEventHandlers refs (SharedState refs ExecutorState) + EventRegistry. The EventRegistry <-> ResultEventHandler cycle is already there, but IMHO okay, as these classes are closely related and have high cohesion.

The "concern weakening" would be that with the change, the EventRegistry is no longer only bound to all the HTTP handlers, but also becomes known to internal components (like ExecutorState), which previously only kept state around. Is this your line of thinking?

Right now, the EventRegistry is quite simple - it's essentially a broadcasting event bus, delivering the messages to all of the websocket connections.

We could think about adding a secondary event bus, which is replicated to the websocket connections via the EventRegistry, but lives on a higher level of the whole object tree - it could be injected into both the EventRegistry and the ExecutorState:

event_bus = EventBus()  # not attached to the name, feel free to bikeshed
event_registry = EventRegistry(event_bus=event_bus)
executor_state = ExecutorState(..., event_bus=event_bus)

def snooze(self):
    # ...
    self.event_bus.send(some_msg)

Then, there would need to be a kind of MessagePump from the internal event bus to the websocket connection for forwarding the messages. I don't know if this event bus would even need to be bi-directional - probably an unidirectional forwarding to the websockets would be enough for now (pattern: multiple producers, single consumer).

This event bus can then be injected into all classes that somehow generate events, even those that don't necessarily have access to the SharedState. (Aside: as I just built a similar thing in another context: in Python it might make sense to use a queue.Queue internally, as it is thread-safe but still can be used in an async context via loop.run_in_executor(...). Specifically, asyncio.Queue is not thread-safe, and can thus also not bridge different async loops (unless you lug around both the queue and the associated loop, and use loop.call_soon_threadsafe, ugh...))

Another topic is that the Message class needs to know about the whole SharedState right now - I think that doesn't need to be the case. Message.state is not used very widely (start_job / finish_job), and Message is mostly meant to keep the message serialization close together. If you like, I can take care of this aspect, as that's just a churny refactoring job, not really related to this PR.

With these changes, the EventRegistry itself doesn't change and isn't coupled to the ExecutorState, there's just an additional component attached to the EventRegistry that pumps asynchronous events from wherever in the system to the websocket connections. There is no additional cycle added, as EventRegistry and ExecutorState are only indirectly connected via the added event bus.

(Is that correct? Something like: ExecutorState refs EventBus, MessagePump refs EventBus, MessagePump refs EventRegistry?)

I hope this makes some kind of sense - I have to think a bit about if the added complexity is worth it, or if there is maybe an easier solution.

client/src/clusterStatus/components/Cluster.tsx

matbryan52 · 2024-02-01T18:19:21Z

Thank you for getting this started. The TS changes are pretty much what I would have written 👍

Thanks, and thanks for taking a look so quickly.

The "concern weakening" would be that with the change, the EventRegistry is no longer only bound to all the HTTP handlers, but also becomes known to internal components (like ExecutorState), which previously only kept state around. Is this your line of thinking?

That's it, yes. There was always the clear distinction between state and API handlers in the server-side code, for me, and this doesn't quite fit into that!

We could think about adding a secondary event bus, which is replicated to the websocket connections via the EventRegistry, but lives on a higher level of the whole object tree - it could be injected into both the EventRegistry and the ExecutorState:
...
Another topic is that the Message class needs to know about the whole SharedState right now - I think that doesn't need to be the case.

This seems like a good solution, a direct channel for broadcasting messages to the client without needing to send the whole application state each time. As for implementing it I could likely do it with enough time, but if you have a good idea of the layout already or want it done quickly then go ahead - push onto this PR or I can rebase this one as you prefer.

sk1p · 2024-02-04T16:56:00Z

/azp run libertem.libertem-data

azure-pipelines · 2024-02-04T16:56:08Z

Azure Pipelines successfully started running 1 pipeline(s).

sk1p · 2024-02-04T17:25:18Z

/azp run libertem.libertem-data

azure-pipelines · 2024-02-04T17:25:27Z

Azure Pipelines successfully started running 1 pipeline(s).

matbryan52 · 2024-02-08T09:54:50Z

I've added the "unsnoozing" state between "snoozed" and "ready".

However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the unsnooze message does not arrive until after the executor is back up again (again #1577)..

I've also noticed that opening the cluster info modal causes the cluster to unsnooze (which could take a while). It's an edge case but I will look into how hard it will be to avoid this.

sk1p · 2024-02-08T10:18:28Z

I've added the "unsnoozing" state between "snoozed" and "ready".

Great!

However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the unsnooze message does not arrive until after the executor is back up again (again #1577)..

Yup, it's basically this issue we need to tackle in a different way:

LiberTEM/src/libertem/web/server.py

Lines 244 to 252 in 5d4d2de

    
           async def create_and_set_executor(): 
        
               if executor_spec is not None: 
        
                   # Executor creation is blocking (and slow) but we 
        
                   # need to run this within the main loop so that Distributed 
        
                   # attaches to that rather than trying to create its own, see: 
        
                   # https://github.com/LiberTEM/LiberTEM/pull/1535#pullrequestreview-1699340445 
        
                   # TL;DR: without this, our call to `loop.run_forever` causes 
        
                   # `RuntimeError: This event loop is already running` 
        
                   shared_state.create_and_set_executor(executor_spec)

I'll try to find some time to look at this later today.

I've also noticed that opening the cluster info modal causes the cluster to unsnooze (which could take a while). It's an edge case but I will look into how hard it will be to avoid this.

Probably, the "cluster info" would need to be cached, such that it is available without a running executor.

`snooze`/`unsnooze`/`make_executor`/`get_executor`/`get_context` and `create_and_set_executor` are now all async, with some blocking operations running in the background pool. Ref LiberTEM#1577

sk1p · 2024-02-26T12:45:48Z

However it doesn't work fully right now I think because executor creation blocks the event loop, meaning the unsnooze message does not arrive until after the executor is back up again (again #1577)..

This should now be fixed, by making the necessary methods async. What I see now on the websocket:

I think this did the job, let's see if this passes CI cleanly (older Python/distributed versions might disagree with me...)

sk1p · 2024-02-26T13:02:14Z

CI failure seems unrelated to the changes, caused by distributed==2024.2.1 (works fine with distributed==2024.2.0)

sk1p · 2024-02-26T14:00:41Z

CI failure seems unrelated to the changes, caused by distributed==2024.2.1 (works fine with distributed==2024.2.0)

Tracked down somehow to dask/dask#10883

sk1p · 2024-02-26T14:04:02Z

CI failure seems unrelated to the changes, caused by distributed==2024.2.1 (works fine with distributed==2024.2.0)

Tracked down somehow to dask/dask#10883

... and also, if I log.debug("%r", params_fut) in _get_future, the issue goes away (wtf?). If I just manually call repr(params_fut) in there, the bug is still triggered.

This was reproduced just by running `tests/executor/test_dask.py::test_fd_limit`, which just runs a `UDF` in a loop. As this is triggered by the consistent hashing changes in `dask`, I have the suspicion that the key for the parameters for different iterations map is the same, and the state somehow gets mixed up (maybe a race condition where one future gets garbage collected, data is deleted from the workers, and this overlaps badly with the next `scatter` operation?)

1) Use the `SpecCluster` as context manager, otherwise the asyncio loop, which is running in a background thread, will not be properly stopped 2) Let the `local_cluster_url` fixture depend on the `event_loop`, such that the cleanup ordering is correct. Only do this in case of Python 3.7, as in newer versions, the scope of the `event_loop` fixture has changed.

sk1p · 2024-02-26T17:41:19Z

/azp run libertem.libertem-data

azure-pipelines · 2024-02-26T17:41:30Z

Azure Pipelines successfully started running 1 pipeline(s).

If we directly modify the request inside of our fixture, it won't take effect immediately, we need the layer of indirection.

sk1p · 2024-02-26T18:01:22Z

/azp run libertem.libertem-data

azure-pipelines · 2024-02-26T18:01:32Z

Azure Pipelines successfully started running 1 pipeline(s).

sk1p · 2024-02-26T18:15:51Z

Sorry for the noise, the event loop issues on Python 3.7 should be ironed out now - and another workaround for dask has been cherry-picked into #1604 and has already already merged. Possibly these issues need to be revisited for Python 3.8 in #1603.

@matbryan52 feel free to take over again!

matbryan52 · 2024-02-26T20:15:34Z

Sorry for the noise, the event loop issues on Python 3.7 should be ironed out now - and another workaround for dask has been cherry-picked into #1604 and has already already merged. Possibly these issues need to be revisited for Python 3.8 in #1603.

@matbryan52 feel free to take over again!

No worries, and thanks for reviewing / adding the async methods.

For me this is good to go, I've included a rebuilt client now.

matbryan52 changed the title ~~WIP: Snooze state display in~~ WIP: Snooze state display in UI Jan 26, 2024

sk1p reviewed Jan 26, 2024

View reviewed changes

client/src/clusterStatus/components/Cluster.tsx Outdated Show resolved Hide resolved

sk1p added this to the 0.14 milestone Feb 5, 2024

sk1p added GUI Web API labels Feb 7, 2024

sk1p self-assigned this Feb 21, 2024

matbryan52 changed the title ~~WIP: Snooze state display in UI~~ Snooze state display in UI Feb 21, 2024

matbryan52 and others added 9 commits February 21, 2024 16:10

Initial proto for comment

451a08b

refactor: Message doesn't depend on SharedState anymore

6a2061d

Implement EventBus and MessagePump

da005ac

Fix type for cluster status check

7fc7235

pep8 fixes

cc4b794

Use grey + broken chain for snooze state

9f3d707

Add unsnoozing state/action

216c58e

Memoize cluster details

959f2bd

Details in other states

027c6eb

matbryan52 force-pushed the snooze-state branch from 5d71f6e to 027c6eb Compare February 21, 2024 15:19

ExecutorState: make some methods async

fe3fb5f

`snooze`/`unsnooze`/`make_executor`/`get_executor`/`get_context` and `create_and_set_executor` are now all async, with some blocking operations running in the background pool. Ref LiberTEM#1577

sk1p mentioned this pull request Feb 26, 2024

Revisit event loop blocking functions #1577

Open

3 tasks

sk1p mentioned this pull request Feb 26, 2024

Work around "UDF run cancelled after 0 partitions" #1604

Merged

7 tasks

Properly depend on event_loop fixture (Py3.7)

93b0f07

If we directly modify the request inside of our fixture, it won't take effect immediately, we need the layer of indirection.

matbryan52 added 2 commits February 26, 2024 21:07

Icon orange

83676f5

Update client

4c6cca0

matbryan52 marked this pull request as ready for review February 26, 2024 20:14

sk1p merged commit 2d39e78 into LiberTEM:master Feb 26, 2024
23 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snooze state display in UI #1578

Snooze state display in UI #1578

matbryan52 commented Jan 26, 2024 •

edited by sk1p

codecov bot commented Jan 26, 2024 •

edited

sk1p left a comment

matbryan52 commented Feb 1, 2024

sk1p commented Feb 4, 2024

azure-pipelines bot commented Feb 4, 2024

sk1p commented Feb 4, 2024

azure-pipelines bot commented Feb 4, 2024

matbryan52 commented Feb 8, 2024

sk1p commented Feb 8, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

azure-pipelines bot commented Feb 26, 2024

sk1p commented Feb 26, 2024

azure-pipelines bot commented Feb 26, 2024

sk1p commented Feb 26, 2024

matbryan52 commented Feb 26, 2024

Snooze state display in UI #1578

Snooze state display in UI #1578

Conversation

matbryan52 commented Jan 26, 2024 • edited by sk1p

Contributor Checklist:

Reviewer Checklist:

codecov bot commented Jan 26, 2024 • edited

Codecov Report

sk1p left a comment

Choose a reason for hiding this comment

matbryan52 commented Feb 1, 2024

sk1p commented Feb 4, 2024

azure-pipelines bot commented Feb 4, 2024

sk1p commented Feb 4, 2024

azure-pipelines bot commented Feb 4, 2024

matbryan52 commented Feb 8, 2024

sk1p commented Feb 8, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

sk1p commented Feb 26, 2024

azure-pipelines bot commented Feb 26, 2024

sk1p commented Feb 26, 2024

azure-pipelines bot commented Feb 26, 2024

sk1p commented Feb 26, 2024

matbryan52 commented Feb 26, 2024

matbryan52 commented Jan 26, 2024 •

edited by sk1p

codecov bot commented Jan 26, 2024 •

edited