-
-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Extremely high load / CPU usage from lemmy_server processes since upgrading to 0.19/0.19.1 #4334
Comments
Database seems completely fine and does not show high usage at all:
|
I have been seeing roughly this same thing, I know there are a couple posts around CPU usage. My instance just myself and a handful of subs. I only have one lemmy_server but it's consuming 367%. I turned my pool_size as well but load doesn't seem to be coming from postgre either. |
I'm having the same issue. Seeing 50-97% CPU usage. |
@eric7456 : I think you may need more than 1 cpu now in Lemmy 0.19. I have 3 cpus and they are all pretty busy most of the time. Before 0.19 the lemmy software used very little cpu, and most of it was on the database side. But now the lemmy process is quite cpu hungry. It would of course be great if it was possible to reduce the cpu usage in 0.19 but so much was rewritten, so not sure its possible. Guess we will see. |
@linux-cultist It will peg as many cores as you throw at it to 100%, no matter how little activity is taking place on the server. Don’t think that’s normal. |
It would be helpful if you can split Lemmy into multiple processes for different tasks, and see which one is causing the high cpu usage. Use the following in docker-compose.yml, whats important are the entrypoints: lemmy:
image: dessalines/lemmy:0.19.1
hostname: lemmy
restart: always
logging: *default-loggingg
environment:
- RUST_LOG="warn"
volumes:
- ./lemmy.hjson:/config/config.hjson:Z
depends_on:
- postgres
- pictrss
entrypoint: lemmy_server --disable-activity-sending --disable-scheduled-tasks
lemmy-federate:
image: dessalines/lemmy:0.19.1
hostname: lemmy
restart: always
logging: *default-loggingg
environment:
- RUST_LOG="warn"
volumes:
- ./lemmy.hjson:/config/config.hjson:Z
depends_on:
- postgres
- pictrs
entrypoint: lemmy_server --disable-http-server --disable-scheduled-tasks
lemmy-tasks:
image: dessalines/lemmy:0.19.1
hostname: lemmy
restart: always
logging: *default-loggingg
environment:
- RUST_LOG="warn"
volumes:
- ./lemmy.hjson:/config/config.hjson:Z
depends_on:
- postgres
- pictrs
entrypoint: lemmy_server --disable-activity-sending --disable-http-server |
@Nutomic: I dont think I have the bug myself, only like 10% usage sometimes on one of the lemmy processes: The rest of the cpu usage is postgres. Maybe @philmichel gets a better picture. |
indeed, i'll turn off the server for now until there's an update |
@phiresky Do you have any idea how to reduce the cpu usage for activity sending? |
It would be good to get some info on the state of the federation queue while this is happening. I don't think anyone has linked an instance here yet? Check https://phiresky.github.io/lemmy-federation-state/ . It's expected right now that when the sending is down for a while it will take a up a lot of resources while the queue is catching up (since it will send out all pending events since the queue went down). That's what happened on lemm.ee. If it's consistently that high even when the queue is caught up it might be some bug.
|
Also, have any of you cleared your sent_activities tables after upgrading? that might also cause this to happen |
@phiresky what do you think about adding some (maybe configurable) throttling to workers, so for example each worker would pause for X milliseconds between handling each activity? Just to prevent self-DoSing when catching up with a backlog. |
@sunaurus It seems like an idea but what would you use as a heuristic? Since you do want it to catch up as quick as possible and you don't want it to start lagging when you have many activities per second in general and your server can handle it. Maybe it would be possible to only do it when "catching up" is detected (last_successful_activity_send < 1 hour ago) or smth but I don't know |
I think having it as a tuneable parameter with some conservative default would ensure that smaller servers don't get killed by this, and bigger servers have a chance to tune it for their needs. I am assuming that if the trade-off is "catching up as quick as possible and making the instance super slow" vs "catching up a bit slower, but keeping the instance usable", most admins will prefer the latter. On lemm.ee, working through the backlog at full speed had the Lemmy process completely using up all the vCPUs I threw at it, until I ran into DB bottlenecks. Under normal conditions (after it caught up), it's now averaging around 5% of 1 vCPU. I would probably aim for something like 50% CPU max, so assuming CPU use scales linearly with amount of activities processed per second, I would just multiply my average activities per second by 10 and use that as a limit for how many activities each worker should process in a second, under all circumstances (so even when already caught up). That would still ensure real-time federation under average conditions. I THINK a maximum of "1 activity per second" (per worker) might be a nice default for most instances, and should work even on the smallest 1vCPU servers, but this would need some testing. |
That seems far too low tbh. When i implemented the queue my reference was 50-100 activities/second (with horizontal scaling though). Can you say how many activities you have in your db in the past 24h?
I mean sure it will "work" for tiny servers but it will also mean any instance with anywhere close to 1 activity per second wil never catch up and even start lagging behind by days automatically |
What you say seems reasonable though maybe as a dynamic limit: Set the speed to 10x what the actual activity speed in the DB is. The question is just where to get that activity in the db stat from because if the instance was just down it's going to be wrong |
I just checked, I have ~40k rows in But making it dynamic seems like a good idea. I see your point about the instance being down making the average difficult to calculate, but maybe we can instead approximate the average by tracking changes in the size of the backlog. We could use 1 activity per second as a baseline limit (after each activity, if a full second has not passed from the start of the iteration, the worker will sleep until the second has passed before starting the next iteration), and then dynamically change this limit in response to the size of the backlog from minute to minute. As long as the backlog for the worker is shrinking "enough" each minute, we can keep the throttling level constant, and if the backlog is not shrinking enough, we can incrementally reduce the throttling until it starts shrinking again. If we respond to the shrinking/growing gradually (maybe a maximum change of +- 50ms per minute to the delay between activities), then we can quite smoothly handle spikes in activity as well. I think this might actually even relatively simple to implement, it seems like the only complicated part is deciding on what "shrinking enough" means. Maybe something like 0.1% of the backlog per minute? |
I think a better solution would be to assign priorities to tasks, so that incoming HTTP requests are always handled before sending out activities. Unfortunately it seems that tokio doesnt have any such functionality. Alternatively we could use the RuntimeMetrics from tokio, and check the number of available workers. Something like, |
Interesting idea, what might also work is use the db pool stats and start throttling if the pool has no available connections https://docs.rs/deadpool/latest/deadpool/struct.Status.html (since sunaurus's primary reason for this was that the database was overloaded?) Either will have weird interaction with horizontal scaling though since then pools (both tokio and pg) are separate |
Is prioritizing straightforward to add? It's probably a good idea for those running everything in a single process, but I still think throttling is necessary in general as well |
Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries? (Don't know how that can happen since you have separate pools for the federation process and the other one) Or that your CPU was at 100% and the normal process didn't get any CPU time? Or that your network was fully used? |
lemmy-federate_1 | Lemmy v0.19.1
Only issue is the CPU stuck at 100%. The rest of the site works fine, maybe a bit slower than typical. No extra network usage that I could tell. |
This was the major problem on lemm.ee for sure - the CPU use of the worker was not a big deal, as I just moved the worker(s) onto separate hardware. Having said that, I can definitely see it being a problem for other instances that run on a single server, ESPECIALLY if the DB is also on that same server, as it is for many (most?) instances. On lemm.ee, our database was just not ready for the extra load that came from full speed catch-up federation (as it resulted in tens of thousands of extra DB queries per second, which were quite cheap individually, but combined did have a significant effect). I've now upgraded the lemm.ee database with a lot of extra headroom, so it's probably not a big issue going forward for lemm.ee, but I can definitely see some other instances running into the same issue. |
That's interesting because it also means that by just reducing the pool size of the fed queue should fix it because that way you can limit it to e.g. 10 concurrent queries per second |
Here are postgres stats from lemmy.pt: mean_exec_time.json One problem is with get_community_follower_inboxes(). Also updating community aggregates is extremely slow for some reason. Lots of time spent on inserting received activities and votes which is probably unavoidable. The total_exec_time.json by @eric7456 also shows different community aggregate updates using up an extreme amount of time. |
Do you mean get_instance_followed_community_inboxes? I don't think |
I experienced this on lemm.ee as well. The same is also true for updating site aggregates. This was the same issue: #4306 (comment) For now I solved it on lemm.ee by just throwing more hardware at the problem, but one thing I mentioned in the above thread:
I actually did a quick hacky experiment with such denormalization on lemm.ee on my old hardware, and saw significant improvement. I added a I assume we could get a similar speed-up for for community aggregates as well by creating some new community_last_activity table, which just stores Basically the trade-off here would be that we need to start updating the last activity timestamps whenever somebody adds a vote/comment/post, but this seems relatively cheap compared to the huge joins that we are doing right now, which postgres seems to struggle with. |
@philmichel Youre right its @sunaurus Would be good if you can make a PR for that. |
Requirements
Summary
Since upgrading to Lemmy 0.19 (and subsequently 0.19.1 - the issue is the same across those versions) via docker-compose, my small single-user instance shows very high load. Many
lemmy_server
processes are spawned, each consuming one full core, slowing the system to a crawl.Steps to Reproduce
htop output:
Technical Details
Lemmy v0.19.1
Version
BE 0.19.1
Lemmy Instance URL
No response
The text was updated successfully, but these errors were encountered: