Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Deleting a user with lots of comments (1,59k) modding a few large communities (~20) kills the backend #3165

Closed
3 of 4 tasks
heylix opened this issue Jun 17, 2023 · 76 comments
Labels

Comments

@heylix
Copy link

heylix commented Jun 17, 2023

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Is this a UI / front end issue? Use the lemmy-ui repo.

Summary

Deleting a user with 1,59k comments modding ~20 communities some of which have a huge amount (~2000) of subscribers makes the PostgreSQL go brrrrt and you get an error

404: FetchError: invalid json response body at http://lemmy:8536/api/v3/site

when you visit the site.

After the request/query has been killed, the user is still there as if nothing happened.

Steps to Reproduce

  1. create user
  2. create 20 communities
  3. have 6000 users subscribe to these various communities
  4. create 1,59k posts
  5. let the users create a few thousand posts
  6. delete the user created in step 1

Technical Details

I am not the admin of this instance, so I'll ask them to provide the logs.

Version

BE 0.17.4

Lemmy Instance URL

feddit.de

@heylix heylix added the bug Something isn't working label Jun 17, 2023
@RocketDerp

This comment was marked as abuse.

@Sh4d
Copy link

Sh4d commented Jul 21, 2023

The query I mentioned in #3649 seems to have just been replicated out and taken out a significant number of instances, including lemmy.world

@Sh4d
Copy link

Sh4d commented Jul 22, 2023

Ok, I'm frustratingly close to figuring this out but I'm just completely stumped.

First off, two triggers seem to be to blame: site_aggregates_comment_delete and person_aggregates_comment_count

I'm focusing on the first one still and trying to figure it out, so I'm just dropping the other one for the moment.

Clone a copy of your database to play around with, and do this:

DROP TRIGGER IF EXISTS person_aggregates_comment_count ON public.comment;

LOAD 'auto_explain';
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
SET auto_explain.log_nested_statements = true;

CREATE OR REPLACE FUNCTION public.site_aggregates_comment_delete()
    RETURNS trigger
    LANGUAGE 'plpgsql'
    COST 100
    VOLATILE NOT LEAKPROOF
AS $BODY$
begin
    IF (was_removed_or_deleted(TG_OP, OLD, NEW)) THEN
        update site_aggregates sa
        set comments = comments - 1;
	RAISE NOTICE '--------';
	RAISE NOTICE 'OLD %s', OLD;
	RAISE NOTICE 'NEW %s', NEW;
	RAISE NOTICE '--------';
    END IF;
    RETURN null;
end 
$BODY$;

You're now in a state where whatever triggers that query will dump to your psql session some debug info, but you're also going to get a copy of every query that runs in your postgres docker log.

Now run something like this in a second term:

rm log; docker logs -f db-postgres-1 --tail=0 2>&1 | tee log

Then run this in your first window:

UPDATE "comment" SET "content" = 'test', "deleted" = 't', "updated" = now() WHERE ("comment"."creator_id" = 379100) and id = 246642;

You'll get this in your debug output:

2023-07-22 21:17:19.882 UTC [34] LOG:  duration: 252.393 ms  plan:
        Query Text: update community_aggregates ca
        set comments = comments - 1 from comment c, post p
        where p.id = c.post_id
          and p.id = OLD.post_id
          and ca.community_id = p.community_id
        Update on community_aggregates ca  (cost=1.13..40.62 rows=0 width=0) (actual time=252.367..252.385 rows=0 loops=1)
          ->  Nested Loop  (cost=1.13..40.62 rows=34 width=26) (actual time=59.884..214.628 rows=359 loops=1)
                ->  Nested Loop  (cost=0.70..16.74 rows=1 width=24) (actual time=59.558..59.574 rows=1 loops=1)
                      ->  Index Scan using post_pkey on post p  (cost=0.42..8.44 rows=1 width=14) (actual time=44.013..44.018 rows=1 loops=1)
                            Index Cond: (id = 619508)
                      ->  Index Scan using community_aggregates_community_id_key on community_aggregates ca  (cost=0.28..8.30 rows=1 width=18) (actual time=15.532..15.534 rows=1 loops=1)
                            Index Cond: (community_id = p.community_id)
                ->  Index Scan using idx_comment_post on comment c  (cost=0.43..23.45 rows=34 width=10) (actual time=0.306..154.030 rows=359 loops=1)
                      Index Cond: (post_id = 619508)
2023-07-22 21:17:19.882 UTC [34] CONTEXT:  SQL statement "update community_aggregates ca
        set comments = comments - 1 from comment c, post p
        where p.id = c.post_id
          and p.id = OLD.post_id
          and ca.community_id = p.community_id"
        PL/pgSQL function community_aggregates_comment_count() line 10 at SQL statement
2023-07-22 21:17:20.300 UTC [34] LOG:  duration: 163.430 ms  plan:
        Query Text: update post_aggregates pa
            set comments = comments - 1
            where pa.post_id = NEW.post_id
        Update on post_aggregates pa  (cost=0.42..8.44 rows=0 width=0) (actual time=163.416..163.421 rows=0 loops=1)
          ->  Index Scan using post_aggregates_post_id_key on post_aggregates pa  (cost=0.42..8.44 rows=1 width=14) (actual time=35.719..35.725 rows=1 loops=1)
                Index Cond: (post_id = 619508)
2023-07-22 21:17:20.300 UTC [34] CONTEXT:  SQL statement "update post_aggregates pa
            set comments = comments - 1
            where pa.post_id = NEW.post_id"
        PL/pgSQL function post_aggregates_comment_deleted() line 4 at SQL statement
2023-07-22 21:17:20.369 UTC [34] LOG:  duration: 58.512 ms  plan:
        Query Text: update site_aggregates sa
                set comments = comments - 1
        Update on site_aggregates sa  (cost=0.00..41.94 rows=0 width=0) (actual time=58.500..58.505 rows=0 loops=1)
          ->  Seq Scan on site_aggregates sa  (cost=0.00..41.94 rows=1675 width=14) (actual time=15.322..19.954 rows=1675 loops=1)
2023-07-22 21:17:20.369 UTC [34] CONTEXT:  SQL statement "update site_aggregates sa
                set comments = comments - 1"
        PL/pgSQL function site_aggregates_comment_delete() line 4 at SQL statement
2023-07-22 21:17:20.370 UTC [34] LOG:  duration: 1324.522 ms  plan:
        Query Text: UPDATE "comment" SET "content" = 'test', "deleted" = 't', "updated" = now() WHERE ("comment"."creator_id" = 379100) and id = 246642;
        Update on comment  (cost=0.43..8.45 rows=0 width=0) (actual time=392.736..392.742 rows=0 loops=1)
          ->  Index Scan using comment_pkey on comment  (cost=0.43..8.45 rows=1 width=47) (actual time=56.523..56.530 rows=1 loops=1)
                Index Cond: (id = 246642)
                Filter: (creator_id = 379100)

Looks reasonable right? I'm lazy and wanted to summarize how many times each function ran:

# grep 'pgSQL function' log | awk '{print $3}' | sort | uniq -c | sort -n
      1 community_aggregates_comment_count()
      1 post_aggregates_comment_deleted()
      1 site_aggregates_comment_delete()

Cool beans. Looks good.

Now try running it against two rows:

lemmy=# UPDATE "comment" SET "content" = 'test', "deleted" = 't', "updated" = now() WHERE ("comment"."creator_id" = 379100) and id in (265652, 279447);
NOTICE:  --------
NOTICE:  OLD (265652,379100,641532,"Holy holy hell ",f,"2023-06-13 17:35:36.13016",,f,https://lemmy.ca/comment/265652,t,0.261818.265652,f,0)s
NOTICE:  NEW (265652,379100,641532,test,f,"2023-06-13 17:35:36.13016","2023-07-22 21:19:17.611686",t,https://lemmy.ca/comment/265652,t,0.261818.265652,f,0)s
NOTICE:  --------
NOTICE:  --------
NOTICE:  OLD (279447,379100,635897,"EDF! EDF! EDF!",f,"2023-06-14 03:01:00.002662",,f,https://lemmy.ca/comment/279447,t,0.256773.279447,f,0)s
NOTICE:  NEW (279447,379100,635897,test,f,"2023-06-14 03:01:00.002662","2023-07-22 21:19:17.611686",t,https://lemmy.ca/comment/279447,t,0.256773.279447,f,0)s
NOTICE:  --------
UPDATE 2

Cool, still looks good.

Hop over to the postgres log to see what the queries look like:

# grep 'pgSQL function' log | awk '{print $3}' | sort | uniq -c | sort -n
      2 community_aggregates_comment_count()
      2 post_aggregates_comment_deleted()
   1677 site_aggregates_comment_delete()

What. The. Fuck.

Somehow once you update more than 1 row in comment, postgres goes nuts running the update query inside the trigger.

What I don't understand is that the RAISE lines in the trigger aren't going multiple times, so what's causing only the update site_aggregates query to run 1677 times?

There's 1675 rows in site_aggregates + the 2 rows I updated and I think that's how postgres gets to running it 1677 times. What's eluding me is what is causing those updates to happen. I can't find any strange triggers or anything else that would run this.

Anyone got any clever ideas? I've poured over the trigger postgres docs and nothing is standing out.

@Sh4d
Copy link

Sh4d commented Jul 22, 2023

The second bug was an easier fix:

CREATE OR REPLACE FUNCTION public.person_aggregates_comment_count() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
begin
    IF (was_restored_or_created(TG_OP, OLD, NEW)) THEN
        update person_aggregates
        set comment_count = comment_count + 1 where person_id = NEW.creator_id;
    ELSIF (was_removed_or_deleted(TG_OP, OLD, NEW)) THEN
        update person_aggregates
        set comment_count = comment_count - 1 where person_id = OLD.creator_id;

        -- If the comment gets deleted, the score calculation trigger won't fire,
        -- so you need to re-calculate
        update person_aggregates ua
        set comment_score = cd.score
        from (
                 select u.id,
                        coalesce(0, sum(cl.score)) as score
                        -- User join because comments could be empty
                 from person u
                          left join comment c on u.id = c.creator_id and c.deleted = 'f' and c.removed = 'f'
                          left join comment_like cl on c.id = cl.comment_id
                 where u.id = OLD.creator_id
                 group by u.id
             ) cd
        where ua.person_id = OLD.creator_id;
    END IF;
    return null;
end $$;

I've added the where u.id = OLD.creator_id in the middle there so that we're not pulling back the entire user/comment association set just to update one user.

If you delete the last trigger I mentioned and then do this update, the UPDATE "comment" SET "content" = 'test', "deleted" = 't', "updated" = now() WHERE ("comment"."creator_id" = 379100); query runs reasonably quickly.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@Sh4d
Copy link

Sh4d commented Jul 22, 2023

this function site_aggregates_comment_delete - lacks a WHERE clause, if that is running every row on every single delete, that would be a major performance bug impacting routine deletes...

Yeah I played around with the actual function too and found something interesting. If you hard code a single "where id = 1234" inside the trigger, this issue goes away.

You can simplify that query down to update site_aggregates set comments = comments - 1 and still reproduce the bug, it doesn't seem related to the weird join in there. Doing update site_aggregates set comments = comments - 1 where id = 1737 will fix this issue though.

Also I think that query might need a conditional on which site id? It seems odd that deleting a comment decreases the aggregated comment count for every instance. Maybe it's meant to be a global counter of some sort, I didn't dig into that yet.

Although I honestly don't have a clue how these functions originate. How are the Rust developers initiating the creation of these PostgreSQL functions? "site_aggregates_comment_delete" only shows up in migration SQL files.

LOL same, I figured I'd just figure out the sql first. Nice to know I'm not going crazy and this isn't something obvious.

@Djones4822
Copy link
Contributor

Djones4822 commented Jul 22, 2023

site_aggregates stores the aggreates of every site that is federated, we need to only be updating the local site aggregate.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

I think you copied and pasted the triggers without the extra where?

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

I used pgdump --schema-only to get my baseline rather than try to make sure I had the latest migration files.

I think your file name suggestion is a reasonable assumption. Thanks for running further with this!

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

I just had a realization. If I patch lemmy.ca and one user deleted themselves, is that going to federate out and crash every lemmy instance?

@RocketDerp

This comment was marked as abuse.

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

When a user deletes since all their comments are edited and then flagged as deleted, i would assume that will run on all instances.

I'm also basing this assumption on the outage multiple instances had yesterday morning. I saw it happening and killed it off on lemmy.ca right away but multiple other sites were impacted. I didn't check incoming federation logs to confirm, but it was the same query we're dealing with here.

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

I suppose a worst-case situation is that a server tries to federate it, the transaction fails on the receiving server, and then they go into retry battles...

No the problem is the query causes locking, so it takes out the entire instance. Unless postgres runs out of ram or something, nothing is going to abort that transaction except a lemmy admin killing the query or restarting postgres

I left this query running for several hours just to see if it would finish. It never times out gracefully.

So worst case is each time a user is deleted, all unpatched instances have their db go down.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@alexandrupero
Copy link
Contributor

As a quick fix for the admins, would setting statement_timeout and idle_in_transaction_session_timeout at db/user level in postgres to some sane values not work to make sure the queries will be aborted rather than run indefinitely?

@RocketDerp

This comment was marked as abuse.

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

It's in #3649. It's not a big query., just the update.

@Sh4d
Copy link

Sh4d commented Jul 23, 2023

Read the conditional function used in the trigger. If the deleted column changes from false to true, it's classified as a comment delete. It's not about actually doing a Sql delete.

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

phiresky commented Jul 25, 2023

Maybe, but that seems like a more complicated change and kind of separate concern. Right now an issue is that for every post / comment insert / delete, it loops through all other posts of the same community and user too. So if you delete all comments, that's O(n^2) of effort because it looks at every comment for every comment. By improving the triggers, that would be O(n) effort. Even if you schedule the delete, you'll still not get better than O(n), just with a better constant factor and in the background. I think DB load issues and latency will already disappear by just getting rid of the O(n^2)

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

You're also mixing a ton of off-topic things in your comment, which makes it a bit hard to respond. I don't like diesel either, but that has nothing to do with the rest of what you're saying. Site admins not understanding the lemmy code is also unrelated to anything like the bugs causing excess queries. I know it's annoying and exhausting to see so many problems, but you're not going to get more people to listen to you by writing more text and jumping topics.

The whole attempt to provide real-time counting of content from multiple incoming peer servers is ambitious and falling on it's face

It's really not though. The issue so far has always been that it's doing tons of unnecessary work, not that it's doing work synchronously. PostgreSQL can handle 1000 inserts per second even on a single table. If lemmy does a constant amount of work for every comment/post/community create/update/delete, then the performance will scale very far, at least to like 100x the current scale.

If you do work in a scheduled job, you still don't reduce the amount of work for most cases, you just move it to be done some time later. That can reduce peak loads, but we're still far from the point where that's actually needed.

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

phiresky commented Jul 25, 2023

I'm trying to be straightforward with you, not condescending. I'm just stating my opinion that the way you're commenting on some of these GitHub issues won't get you much benefit except more frustration. I tried messaging you on matrix earlier, but you didn't respond.

You're doing good work with the issue diagnosis and pull requests you made.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

Can you show me in the code where it does this for INSERT

You're right, it's only on delete. I was looking at person_aggregates_comment_count and person_aggregates_post_count but it's only in the ELSIF was_removed_or_deleted case.

trying to solve the issue of so many triggers being independent

I think merging most of the triggers based on the source table is a great idea, not sure what dessalines and nutomic think though. Did you intentionally not include the INSERT case though? e.g. right now person_aggregates_post_count triggers after insert or delete or update of removed, deleted on post and I think those parts could be included there as well? You can still use the TG_OP=XXX filter or the existing was_restored_or_created() and was_removed_or_deleted() functions

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

Just saw #2910 and some of those triggers actually don't make sense because coalesce(0, X) always returns 0 so those subqueries are useless

@RocketDerp

This comment was marked as abuse.

@WayneSheppard
Copy link

I don't suspect sabotage. The founders have admitted not being SQL experts. And many issues only appeared once the user count got up there.

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

This should be fixed in 0.18.3. Please comment or open a new issue if it's still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests