-
-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754
Conversation
…n of the same federation queues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution is not too bad, assuming that restarting all workers can't be triggered by networking or deserialization errors for anything other than the database
LemmyResult::Ok(()) | ||
// if the task was not intentionally cancelled, then this whole lambda will be run again by | ||
// CancellableTask after this | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all this we could create the SendManager
outside of CancellableTask
and pass it in through Mutex or similar. That workers
are preserved even if the task gets restarted. But lets do that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm again not sure why you like Mutexes so much, especially instead of something as simple as this, how can a mutex be simpler than literally 5 lines of code? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not a mutex specifically, my point is that we move the workers hashmap outside the task so we dont need to restart all workers if it crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense. I've checked the code again based on idea (1) above, and there's really only a single failure point in the loop and that is the "get all instances" query. So that query could be made infallible without too much effort.
#4733 made an erroneous change that removes the restart loop from the CancellableTask. The change was made since both of the codes that use CancellableTask are usually endless until cancelled.
But this restart loop is necessary since the task runner can return due to an error without having been cancelled. Without this loop, if a task exits without having been cancelled, this state is neither logged nor fixed. Both lambdas return a Result<(), Error>, where the error can happen in a number of places. In these cases, the task state itself is not easily recoverable (would need many special cases), but the design allows it to be recreated from scratch.
In reality, one example how this happens if the DB pool encounters an error. The task dies and is restarted with a "task exited, restarting" being printed.
In addition, there seems to be an issue where sometimes the federation SendManager starts multiple senders for one or more instances: #4609
I think this is caused by the issue described in a comment there: #4609 (comment) :
CancellableTask assumes that the lambda it is given will clean up after itself - that is, if it returns with an error, it is designed to rerun the same lambda. For the
InstanceWorker
s this works fine, when they exit they at most forget a bit of state which will cause a handful of resends.But, when the SendManager exits with an error, it previously did not clean up the InstanceWorkers it spawned itself. Those then keep running forever, the outer copy of the cancellation token having been dropped, while SendMenager creates new InstanceWorkers for the same domains.
I see two solutions:
SendManager::do_loop
fromResult<()>
to()
. Then change every instance of?
in the function to something infallible, for example retrying db queries directly etc. This has the advantage of not requiring recreation of all instanceworkers on intermittent failure, and the disadvantage of requiring special cases for every?
.do_loop
returns, kill all the instance workers regardless of whether do_loop returned an error or not. This has the advantage of being simple since @Nutomic already split thecancel
code into a method, and the disadvantage of when the "get instances" DB query fails the code immediately killing all instance workers and restarting them, causing more load.This PR implements (2) since it is the most direct fix. It also reverts the erronous change of not restarting cancellabletasks on failure.