Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denormalize community_id into post_aggregates for a 1000x speed-up when loading posts #3653

Merged
merged 6 commits into from Jul 20, 2023

Conversation

sunaurus
Copy link
Collaborator

Credit to @phiresky for this idea, originally posted in comments of #2994

This PR adds community_id to post_aggregates (& a new index on post_aggregates) to enable joining community directly to post_aggregates when querying posts.

On lemm.ee, this optimization speeds up the query for front page of subscribed posts ~1000x, from several seconds to to just milliseconds. You can check a before/after of query plans here: https://gist.github.com/sunaurus/856e03165bb0c0010505afeebde45230

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at these join filters:

Screenshot_2023-07-18-09-42-04-582_mark via gp

This hints that the same process could be done with post.creator_id -> post_aggregates.creator_id .

I also think this makes the case that either:

  • post_aggregates should be the "main" table that the others are joined to, and primarily filter from.
  • Or that everywhere a post.XXX column is in the filters or joins, it should be replaced with post_aggregates.XXX

I also don't know what new indexes should be added, but lets make sure they're actually getting used.

@sunaurus
Copy link
Collaborator Author

sunaurus commented Jul 18, 2023

At this point, the difference is very small, but indeed it is a bit faster to treat post_aggregates as the main table completely. Will push additional commits soon.

@phiresky
Copy link
Collaborator

phiresky commented Jul 18, 2023

post_aggregates should be the "main" table that the others are joined to, and primarily filter from.

Is there a specific reason the post_aggregates and post table are spilt? In general I think it would make sense to have everything in the same table. (same for comments /person...). 1:1 relations aren't that great in postgresql in general imo, tables with a huge amount of columns work pretty well. It would increase contention a bit because every change in votes would lock the post row for a bit (just writes, not reads) but I don't think that justifies the split. Or is there other reasons for the spilt?

@sunaurus
Copy link
Collaborator Author

New commits contain the following changes

  1. Remove the unused index
  2. Added creator_id to post_aggregates
  3. Made post_aggregates the "main" table for PostQuery

@sunaurus
Copy link
Collaborator Author

sunaurus commented Jul 18, 2023

  1. Made post_aggregates the main table for PostView as well now (this was already very fast before, but indeed it is even faster now with these changes)

@sunaurus sunaurus force-pushed the post_aggregates_community_id branch 2 times, most recently from e71461e to 565df53 Compare July 19, 2023 13:34
crates/db_views/src/post_view.rs Show resolved Hide resolved

ALTER TABLE post_aggregates
ALTER COLUMN community_id SET NOT NULL,
ALTER COLUMN creator_id SET NOT NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this one, much like the post table has indexes on community_id and creator_id, it would probably be a good idea to add these two indexes to post_aggregates now too:

IE:

create index idx_post_aggregates_community on post_aggregates (community_id);
create index idx_post_aggregates_creator on post_aggregates (creator_id);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added these indexes as well

Copy link
Collaborator Author

@sunaurus sunaurus Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dessalines I just discovered that adding these two indexes messes up the query plan again 😃

Edit: hmm, I just got some 3-second query plans with the indexes. Then I dropped the indexes - back to a few ms. Then re-added, and it's still at a few ms. So I can't reproduce the problem at the moment, but I will investigate a bit more..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the query plans are definitely consistently worse with those indexes.
With indexes: https://explain.depesz.com/s/W7cF
Without indexes: https://explain.depesz.com/s/4qds

Copy link
Collaborator Author

@sunaurus sunaurus Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted these indexes for now, let me know if you disagree with that

Copy link
Collaborator

@phiresky phiresky Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, you didn't add any index to get the speedup? the "right" index should be post_aggregates(community_id, featured_local desc, hot_rank desc)

it should use that one when the user has only subscribed to a few communities, and the idx_featured_local_hot_rank when subscribed to many communities

Copy link
Collaborator Author

@sunaurus sunaurus Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed a bit on Matrix already, but commenting here as well for visibility:


the "right" index should be post_aggregates(community_id, featured_local desc, hot_rank desc)

This index also messes up the query plan. It indeed makes the queries super fast for users with few subscriptions, but it makes them slow again for others.

Without that index
User with 86 subscriptions: 3ms query execution
User with 0 subscriptions: 700 ms query execution
(I didn't measure exact time, but it's also very fast for logged out users without the index)

With that index
User with 86 subscrptions: 2-3 second query execution
User with 0 subscriptions: 1ms query execution

phiresky on Matrix said:

weird. do the query plans have larger misestimates?
if so try setting default statistics target to 1000 and run analyze again. if not then I don't know why PG is being dumb


I will see tomorrow if playing around with statistics will help. I know there's also CREATE STATISTICS in postgres, but I've never used it so would need to investigate it a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit and sort is the main culprit, but also its weird to me that postgres sometimes chooses a slower index.

We can tweak those indexes later, I'll try to tag the issues as DB

@sunaurus sunaurus requested a review from dessalines July 19, 2023 15:05
@sunaurus sunaurus force-pushed the post_aggregates_community_id branch from ed91fc0 to f58d660 Compare July 19, 2023 15:07
Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thx!

I'll let @Nutomic or @phiresky look over before merging.

@sunaurus sunaurus force-pushed the post_aggregates_community_id branch from f58d660 to 021d3a6 Compare July 19, 2023 18:14
@dessalines dessalines merged commit b511c2e into LemmyNet:main Jul 20, 2023
1 check passed
@sunaurus sunaurus deleted the post_aggregates_community_id branch July 20, 2023 21:04
Nutomic pushed a commit that referenced this pull request Jul 21, 2023
…en loading posts (#3653)

* Denormalize community_id into post_aggregates for a 1000x speed-up when loading posts

* Remove unused index

* Add creator_id to post_aggregates

* Use post_aggregates as main table for PostQuery

* Make post_aggregates the main table for PostView

* Reformat SQL
@RocketDerp

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants