fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754

phiresky · 2024-05-29T13:18:29Z

#4733 made an erroneous change that removes the restart loop from the CancellableTask. The change was made since both of the codes that use CancellableTask are usually endless until cancelled.

But this restart loop is necessary since the task runner can return due to an error without having been cancelled. Without this loop, if a task exits without having been cancelled, this state is neither logged nor fixed. Both lambdas return a Result<(), Error>, where the error can happen in a number of places. In these cases, the task state itself is not easily recoverable (would need many special cases), but the design allows it to be recreated from scratch.

In reality, one example how this happens if the DB pool encounters an error. The task dies and is restarted with a "task exited, restarting" being printed.

In addition, there seems to be an issue where sometimes the federation SendManager starts multiple senders for one or more instances: #4609

I think this is caused by the issue described in a comment there: #4609 (comment) :
CancellableTask assumes that the lambda it is given will clean up after itself - that is, if it returns with an error, it is designed to rerun the same lambda. For the InstanceWorkers this works fine, when they exit they at most forget a bit of state which will cause a handful of resends.

But, when the SendManager exits with an error, it previously did not clean up the InstanceWorkers it spawned itself. Those then keep running forever, the outer copy of the cancellation token having been dropped, while SendMenager creates new InstanceWorkers for the same domains.

I see two solutions:

Change the return type of SendManager::do_loop from Result<()> to (). Then change every instance of ? in the function to something infallible, for example retrying db queries directly etc. This has the advantage of not requiring recreation of all instanceworkers on intermittent failure, and the disadvantage of requiring special cases for every ?.
When do_loop returns, kill all the instance workers regardless of whether do_loop returned an error or not. This has the advantage of being simple since @Nutomic already split the cancel code into a method, and the disadvantage of when the "get instances" DB query fails the code immediately killing all instance workers and restarting them, causing more load.

This PR implements (2) since it is the most direct fix. It also reverts the erronous change of not restarting cancellabletasks on failure.

…n of the same federation queues

dullbananas

This solution is not too bad, assuming that restarting all workers can't be triggered by networking or deserialization errors for anything other than the database

Nutomic · 2024-05-29T21:37:56Z

crates/federate/src/lib.rs

+        LemmyResult::Ok(())
+        // if the task was not intentionally cancelled, then this whole lambda will be run again by
+        // CancellableTask after this
+      }
    })


Instead of all this we could create the SendManager outside of CancellableTask and pass it in through Mutex or similar. That workers are preserved even if the task gets restarted. But lets do that in a separate PR.

I'm again not sure why you like Mutexes so much, especially instead of something as simple as this, how can a mutex be simpler than literally 5 lines of code? :D

Maybe not a mutex specifically, my point is that we move the workers hashmap outside the task so we dont need to restart all workers if it crashes.

Right, makes sense. I've checked the code again based on idea (1) above, and there's really only a single failure point in the loop and that is the "get all instances" query. So that query could be made infallible without too much effort.

fix both permanent stopping of federation queues and multiple creatio…

13ff059

…n of the same federation queues

phiresky requested review from Nutomic, dessalines, dullbananas and SleeplessOne1917 as code owners May 29, 2024 13:18

dullbananas approved these changes May 29, 2024

View reviewed changes

Nutomic approved these changes May 29, 2024

View reviewed changes

Merge branch 'main' into fix-dupe-activity-sending

4aea0ae

Nutomic enabled auto-merge (squash) May 29, 2024 21:22

Nutomic reviewed May 29, 2024

View reviewed changes

fix

d0b4f5a

Nutomic merged commit e8a7bb0 into main May 30, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754

fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754

phiresky commented May 29, 2024

dullbananas left a comment

Nutomic May 29, 2024

phiresky May 30, 2024

Nutomic May 30, 2024

phiresky May 30, 2024

fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754

fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754

Conversation

phiresky commented May 29, 2024

dullbananas left a comment

Choose a reason for hiding this comment

Nutomic May 29, 2024

Choose a reason for hiding this comment

phiresky May 30, 2024

Choose a reason for hiding this comment

Nutomic May 30, 2024

Choose a reason for hiding this comment

phiresky May 30, 2024

Choose a reason for hiding this comment