Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling federation #3062

Closed
Nutomic opened this issue Jun 13, 2023 · 14 comments
Closed

Scaling federation #3062

Nutomic opened this issue Jun 13, 2023 · 14 comments
Labels
area: federation support federation via activitypub enhancement New feature or request

Comments

@Nutomic
Copy link
Member

Nutomic commented Jun 13, 2023

Yesterday I posted an announcement telling admins of large instances that they need to increase the "federation worker count". These workers are needed to send outgoing federated actions. Since then I did the same adjustment on lemmy.ml, and had to increase the worker count up to 360.000. Luckily this isnt causing any problems yet, but it points to a scaling limitation which will likely become important in the future.

To understand this limitation its important how federation in Lemmy communities works. Lets say a user from sopuli upvotes a comment in !memes@lemmy.ml. This upvote action is sent via Activitypub to the lemmy.ml server, which forwards it to all instances where at least one user follows the memes community. The same happens for all other actions like creating or editing posts/comments, mod actions, and so on. The problem is that there are lots of these actions (particularly votes), and they need to be forwarded to lots of different servers. For example the recent top posts in /c/memes have around 1500 upvotes. Lets assume that users from 100 different instances follow the community. Then federating the votes for this single post requires 1500 * 100 = 150.000 HTTP POST requests to other servers. On top of this are requests to federate comments and comment votes which likely reach a similar magnitude.

Here are some possible workarounds and solutions:

  • Dont federate votes. This means users would only see votes from other users on their own instance. Its obviously far from idea, but can be a quick emergency solution to prevent federation from breaking entirely.
  • Migrate large communities away from lemmy.ml. This requires a lot of effort as users have to subscribe to new communities manually. It is also unlikely to be a permanent solution, as other servers will also get overloaded sooner or later.
  • Instead of federating individual votes, send the aggregate number of votes during the last hour.

The last option seems to be preferable, but is not easy to implement. Afaik there is no prior example of sending aggregate data over Activitypub, so it would require an extension which would be incompatible with other platforms. It might also be necessary to rewrite the way post ranking is calculated. On the other hand this could be an improvement for privacy, as other instances dont see which particular user upvoted or downvoted a post.

@Nutomic Nutomic added enhancement New feature or request area: federation support federation via activitypub labels Jun 13, 2023
@TailyFair
Copy link

Is aggregated votes sending secure? Is it possible that some bad actor instances would send large fake counts?

@dadino
Copy link

dadino commented Jun 13, 2023

If sending out the requests is the problem for lemmy.lm, you could send a single request (instead of 100) to a separate worker, on a separate server, that then forwards it to the 100 instances. You could create a load balancer that accepts multiple workers and decides where to send the single request, per request.

@simonsan
Copy link

would it be possible to reverse the process? i.e. instead of sending post requests when new votes arrive, to aggregate them and make them available as a subscription on the server, so that they can be retrieved via GET requests from the respective instance?

@tgy
Copy link

tgy commented Jun 13, 2023

I am not particularly knowledgeable, but could Shared Inbox (part of the ActivityPub protocol) be of help?

Other people seem to have the same issue

@dessalines
Copy link
Member

It sounds like a lot of requests, but maybe it isn't a big problem in terms of CPU or network. Rather than aggregating requests, perhaps we just need to make sure our federation job queue is efficient, and if it isn't, possibly use a different one.

@rlhennig
Copy link

IMO, aggregation is the best route to take here. This is not actually a new problem, and aggregation is how it's done when disseminating updates between routers runnig BGP on the Internet. (I'm a network architect, so that's how I think of things.) OSPF, another routing protocol, also uses aggregation. When you have hundreds of thousands of updates, sending each one just isn't efficient. Things are going to tip over at some point if you do it that way. It may be more work to refactor things, but anything else I think is--at best--just buying you time.

@calculuschild
Copy link

Are votes being updated immediately across all federated instances? If so, is live updating of vote counts even necessary? Aggregating votes every hour would help, but an hour lag seems like a lot.

Why not just retrieve votes from the hosting instance upon the thread being accessed by each user? Surely updating votes on page view will involve fewer requests than sending out a wave of updates on every vote.

@dadino
Copy link

dadino commented Jun 14, 2023

Are votes being updated immediately across all federated instances? If so, is live updating of vote counts even necessary? Aggregating votes every hour would help, but an hour lag seems like a lot.

Why not just retrieve votes from the hosting instance upon the thread being accessed by each user? Surely updating votes on page view will involve fewer requests than sending out a wave of updates on every vote.

I guess votes are needed to sort posts.
Even 1 bundle per minute would drastically reduce the number of requests, while mantaining a quasi-live update of the content.

@DomiStyle
Copy link

DomiStyle commented Jun 14, 2023

The last option seems to be preferable, but is not easy to implement. Afaik there is no prior example of sending aggregate data over Activitypub, so it would require an extension which would be incompatible with other platforms. It might also be necessary to rewrite the way post ranking is calculated. On the other hand this could be an improvement for privacy, as other instances dont see which particular user upvoted or downvoted a post.

To suggest a fourth option: I don't think aggregation is necessary as much as batching is, you could still send each vote individually in a single request but handle them when there is less server load. That way fake votes is less of an issue.

I'm not entirely sure how ActivityPub works but I assume it is legal to respond with a 429 or a 503 and a Retry-After header?

That way the source server could send updates immediately, if the target is overloaded it sends a Retry-After header and the source server batches all updates for that target server together until the time expires.

Could also add prioritization for important events like posts and comments and push votes to a later date when load should be lower. I think having the votes arrive reliably is more important than having them update live. Reddit also does not update votes live.

I just briefly glanced over the ActivityPub spec but there seems to be a collection type for likes, is it possible to use this at least for the votes? https://www.w3.org/TR/activitypub/#likes

@Nutomic
Copy link
Member Author

Nutomic commented Jun 14, 2023

It seems like this is not really a problem like I thought because these send jobs are very lightweight. Probably the solution is to remove the worker count setting so that unlimited workers can be created on demand.

@DomiStyle
Copy link

@Nutomic Is there a different issue to follow that's currently hindering federation then? Federating instances are missing comments and votes.

For example, here's a random post on !technology@lemmy.ml that's 4 hours old: https://lemmy.ml/post/1250165

Here's how it looks on different instances:

Instance Comments Votes
lemmy.ml 13 95
beehaw.org 6 22
lemmy.world 11 64
sh.itjust.works 10 45

For comparison, here's a random post from !technology@beehaw.org that's also 4 hours old: https://beehaw.org/post/548636

Instance Comments Votes
beehaw.org 70 59
lemmy.ml 63 20
lemmy.world 66 160
sh.itjust.works 62 127

While the votes being different is not such a huge deal, missing comments is absolutely a huge issue.

@Kryptortio
Copy link

Does Lemmy utilize some kind of swarm sharing? I'm thinking that scalability could increase a lot if when you ask one instance for an update and it happens to have updates from other instances that you don't have then it could send those as well. If they are aggregated you could check the timestamps to determine if you have collected the entire timeline.

You would need to send more data in the initial request to let the instance know what data you need but the total amount of requests could be greatly reduced and the load of sharing all the data could be distributed across all instances.

@Nutomic
Copy link
Member Author

Nutomic commented Jun 14, 2023

@DomiStyle Im not sure whats the reason, its certainly worth investating. A possibility would be instance blocks or user bans. Could also be networking problems, or a software bug. There is also this issue which means that activities will get lost during restart.

@Kryptortio Federation uses POST requests, except for explicit user requests to fetch a remote object (eg searching a community url). So there is no automatic "asking other servers". It might be possible to implement something like that, but it would require major changes to federation logic. I suggest you read the Lemmy federation docs and Activitypub standard to get a better understanding how it all works.

@Nutomic
Copy link
Member Author

Nutomic commented Jun 15, 2023

Closing in favor of #3121

@Nutomic Nutomic closed this as completed Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: federation support federation via activitypub enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants