-
Notifications
You must be signed in to change notification settings - Fork 61
Rollback "claimed" runs to "available" if worker doesn't join run channel #3566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1b6b82e
cec6dc6
b5106e3
8708295
e849b37
00e324a
cca24de
fcaa5ba
7924732
09d254c
f0c0c74
903f356
c54edd4
fb1ac9b
0a874d9
3984d14
3302126
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,20 @@ defmodule LightningWeb.RunChannel do | |
| Runs.get_project_id_for_run(run) do | ||
| Sentry.Context.set_extra_context(%{run_id: id}) | ||
|
|
||
| # Notify the worker channel that this run channel has been joined | ||
| # Use PubSub for cross-node communication in clustered environments | ||
| case socket.assigns[:worker_id] do | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @taylordowns2000 |
||
| nil -> | ||
| # No worker ID available, continue normally | ||
| :ok | ||
|
|
||
| worker_id -> | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @taylordowns2000 Are we vulnerable to a race condition here> E.g, we retrieve the run, broadcast a message that the timer can be reset and then send data back to the worker? In the gap between the broadcast and something acting on it, the timeout could kick in, unclaiming the run and potentially allowing situation where two workers are running it? This feels like the 'ack-ack' scenario (although we can use the run channel response as the 'ack-ack' - and I think that addresses your question about the Worker A, Worker B scenario. When a worker joins the run channel, I think that Lightning should confirm that the run is still claimed, lock the run and apply a change that makes rollback impossible (it feels like the responsible way to do this would be to introduce a new state), and then only send the :ok back to the worker. If none of these conditions are true, the Worker gets an error.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rorymckinley , would this be handled better by ensuring that a run channel can only be joined once? or by a single worker? (this was a thought I had in the initial PR.)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eish, or would that prevent workers from re-connecting to the run channel if they got temporarily disconnected?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @taylordowns2000 Even if a single-connect is an option I do not think it would solve the problem entirely and may introduce a new problem: Showing my workings, apologies for the repetition of what I said before. I can't see any obvious place where we are checking the state of a run before we send stuff back to the worker (probably never needed it before?) - so race 1 would be that the run has already been rolled back (essentially Worker A sans Worker B). A single connect does not seem to solve this. Race 2 would be that the run has not been rolled back but in between the broadcast and the action the run gets rolled back. Using the Highlander approach you suggest would prevent another worker picking up the run once it has rolled back but......
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And, oops: I forgot to add that I do not know enough about workers to know know if it will reconnect mid-run to the run channel - @josephjclark ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eish. Yeah, it does sound like adding another state (
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the worker will just reconnect when the connection is lost. The socket should handle this. I'll test locally later and see if I can confirm. |
||
| Lightning.broadcast( | ||
| "worker_channel:#{worker_id}", | ||
| {:run_channel_joined, id, worker_id} | ||
| ) | ||
| end | ||
|
|
||
| {:ok, | ||
| socket | ||
| |> assign(%{ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 To avoid race conditions, it may be worth locking each record, checking that the record should be rolled back and then rolling it back.