Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix both permanent stopping of federation queues and multiple creation of the same federation queues #4754

Merged
merged 3 commits into from
May 30, 2024

Conversation

phiresky
Copy link
Collaborator

#4733 made an erroneous change that removes the restart loop from the CancellableTask. The change was made since both of the codes that use CancellableTask are usually endless until cancelled.

But this restart loop is necessary since the task runner can return due to an error without having been cancelled. Without this loop, if a task exits without having been cancelled, this state is neither logged nor fixed. Both lambdas return a Result<(), Error>, where the error can happen in a number of places. In these cases, the task state itself is not easily recoverable (would need many special cases), but the design allows it to be recreated from scratch.

In reality, one example how this happens if the DB pool encounters an error. The task dies and is restarted with a "task exited, restarting" being printed.

In addition, there seems to be an issue where sometimes the federation SendManager starts multiple senders for one or more instances: #4609

I think this is caused by the issue described in a comment there: #4609 (comment) :
CancellableTask assumes that the lambda it is given will clean up after itself - that is, if it returns with an error, it is designed to rerun the same lambda. For the InstanceWorkers this works fine, when they exit they at most forget a bit of state which will cause a handful of resends.

But, when the SendManager exits with an error, it previously did not clean up the InstanceWorkers it spawned itself. Those then keep running forever, the outer copy of the cancellation token having been dropped, while SendMenager creates new InstanceWorkers for the same domains.

I see two solutions:

  1. Change the return type of SendManager::do_loop from Result<()> to (). Then change every instance of ? in the function to something infallible, for example retrying db queries directly etc. This has the advantage of not requiring recreation of all instanceworkers on intermittent failure, and the disadvantage of requiring special cases for every ?.
  2. When do_loop returns, kill all the instance workers regardless of whether do_loop returned an error or not. This has the advantage of being simple since @Nutomic already split the cancel code into a method, and the disadvantage of when the "get instances" DB query fails the code immediately killing all instance workers and restarting them, causing more load.

This PR implements (2) since it is the most direct fix. It also reverts the erronous change of not restarting cancellabletasks on failure.

Copy link
Collaborator

@dullbananas dullbananas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution is not too bad, assuming that restarting all workers can't be triggered by networking or deserialization errors for anything other than the database

@Nutomic Nutomic enabled auto-merge (squash) May 29, 2024 21:22
LemmyResult::Ok(())
// if the task was not intentionally cancelled, then this whole lambda will be run again by
// CancellableTask after this
}
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this we could create the SendManager outside of CancellableTask and pass it in through Mutex or similar. That workers are preserved even if the task gets restarted. But lets do that in a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm again not sure why you like Mutexes so much, especially instead of something as simple as this, how can a mutex be simpler than literally 5 lines of code? :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not a mutex specifically, my point is that we move the workers hashmap outside the task so we dont need to restart all workers if it crashes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense. I've checked the code again based on idea (1) above, and there's really only a single failure point in the loop and that is the "get all instances" query. So that query could be made infallible without too much effort.

@Nutomic Nutomic merged commit e8a7bb0 into main May 30, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants