-
-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Deleting a user with lots of comments (1,59k) modding a few large communities (~20) kills the backend #3165
Comments
This comment was marked as abuse.
This comment was marked as abuse.
The query I mentioned in #3649 seems to have just been replicated out and taken out a significant number of instances, including lemmy.world |
Ok, I'm frustratingly close to figuring this out but I'm just completely stumped. First off, two triggers seem to be to blame: site_aggregates_comment_delete and person_aggregates_comment_count I'm focusing on the first one still and trying to figure it out, so I'm just dropping the other one for the moment. Clone a copy of your database to play around with, and do this:
You're now in a state where whatever triggers that query will dump to your psql session some debug info, but you're also going to get a copy of every query that runs in your postgres docker log. Now run something like this in a second term:
Then run this in your first window:
You'll get this in your debug output:
Looks reasonable right? I'm lazy and wanted to summarize how many times each function ran:
Cool beans. Looks good. Now try running it against two rows:
Cool, still looks good. Hop over to the postgres log to see what the queries look like:
What. The. Fuck. Somehow once you update more than 1 row in comment, postgres goes nuts running the update query inside the trigger. What I don't understand is that the RAISE lines in the trigger aren't going multiple times, so what's causing only the update site_aggregates query to run 1677 times? There's 1675 rows in site_aggregates + the 2 rows I updated and I think that's how postgres gets to running it 1677 times. What's eluding me is what is causing those updates to happen. I can't find any strange triggers or anything else that would run this. Anyone got any clever ideas? I've poured over the trigger postgres docs and nothing is standing out. |
The second bug was an easier fix:
I've added the If you delete the last trigger I mentioned and then do this update, the |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
Yeah I played around with the actual function too and found something interesting. If you hard code a single "where id = 1234" inside the trigger, this issue goes away. You can simplify that query down to Also I think that query might need a conditional on which site id? It seems odd that deleting a comment decreases the aggregated comment count for every instance. Maybe it's meant to be a global counter of some sort, I didn't dig into that yet.
LOL same, I figured I'd just figure out the sql first. Nice to know I'm not going crazy and this isn't something obvious. |
|
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
I think you copied and pasted the triggers without the extra where? |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
I used pgdump --schema-only to get my baseline rather than try to make sure I had the latest migration files. I think your file name suggestion is a reasonable assumption. Thanks for running further with this! |
I just had a realization. If I patch lemmy.ca and one user deleted themselves, is that going to federate out and crash every lemmy instance? |
This comment was marked as abuse.
This comment was marked as abuse.
When a user deletes since all their comments are edited and then flagged as deleted, i would assume that will run on all instances. I'm also basing this assumption on the outage multiple instances had yesterday morning. I saw it happening and killed it off on lemmy.ca right away but multiple other sites were impacted. I didn't check incoming federation logs to confirm, but it was the same query we're dealing with here. |
No the problem is the query causes locking, so it takes out the entire instance. Unless postgres runs out of ram or something, nothing is going to abort that transaction except a lemmy admin killing the query or restarting postgres I left this query running for several hours just to see if it would finish. It never times out gracefully. So worst case is each time a user is deleted, all unpatched instances have their db go down. |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
As a quick fix for the admins, would setting |
This comment was marked as abuse.
This comment was marked as abuse.
It's in #3649. It's not a big query., just the update. |
Read the conditional function used in the trigger. If the deleted column changes from false to true, it's classified as a comment delete. It's not about actually doing a Sql delete. |
This comment was marked as abuse.
This comment was marked as abuse.
Maybe, but that seems like a more complicated change and kind of separate concern. Right now an issue is that for every post / comment insert / delete, it loops through all other posts of the same community and user too. So if you delete all comments, that's O(n^2) of effort because it looks at every comment for every comment. By improving the triggers, that would be O(n) effort. Even if you schedule the delete, you'll still not get better than O(n), just with a better constant factor and in the background. I think DB load issues and latency will already disappear by just getting rid of the O(n^2) |
This comment was marked as abuse.
This comment was marked as abuse.
You're also mixing a ton of off-topic things in your comment, which makes it a bit hard to respond. I don't like diesel either, but that has nothing to do with the rest of what you're saying. Site admins not understanding the lemmy code is also unrelated to anything like the bugs causing excess queries. I know it's annoying and exhausting to see so many problems, but you're not going to get more people to listen to you by writing more text and jumping topics.
It's really not though. The issue so far has always been that it's doing tons of unnecessary work, not that it's doing work synchronously. PostgreSQL can handle 1000 inserts per second even on a single table. If lemmy does a constant amount of work for every comment/post/community create/update/delete, then the performance will scale very far, at least to like 100x the current scale. If you do work in a scheduled job, you still don't reduce the amount of work for most cases, you just move it to be done some time later. That can reduce peak loads, but we're still far from the point where that's actually needed. |
This comment was marked as abuse.
This comment was marked as abuse.
I'm trying to be straightforward with you, not condescending. I'm just stating my opinion that the way you're commenting on some of these GitHub issues won't get you much benefit except more frustration. I tried messaging you on matrix earlier, but you didn't respond. You're doing good work with the issue diagnosis and pull requests you made. |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
You're right, it's only on delete. I was looking at person_aggregates_comment_count and person_aggregates_post_count but it's only in the ELSIF was_removed_or_deleted case.
I think merging most of the triggers based on the source table is a great idea, not sure what dessalines and nutomic think though. Did you intentionally not include the INSERT case though? e.g. right now |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
Just saw #2910 and some of those triggers actually don't make sense because coalesce(0, X) always returns 0 so those subqueries are useless |
This comment was marked as abuse.
This comment was marked as abuse.
I don't suspect sabotage. The founders have admitted not being SQL experts. And many issues only appeared once the user count got up there. |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This should be fixed in 0.18.3. Please comment or open a new issue if it's still an issue. |
Requirements
Summary
Deleting a user with 1,59k comments modding ~20 communities some of which have a huge amount (~2000) of subscribers makes the PostgreSQL go brrrrt and you get an error
when you visit the site.
After the request/query has been killed, the user is still there as if nothing happened.
Steps to Reproduce
Technical Details
I am not the admin of this instance, so I'll ask them to provide the logs.
Version
BE 0.17.4
Lemmy Instance URL
feddit.de
The text was updated successfully, but these errors were encountered: