New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Denormalize community_id into post_aggregates for a 1000x speed-up when loading posts #3653
Denormalize community_id into post_aggregates for a 1000x speed-up when loading posts #3653
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at these join filters:
This hints that the same process could be done with post.creator_id
-> post_aggregates.creator_id
.
I also think this makes the case that either:
post_aggregates
should be the "main" table that the others are joined to, and primarily filter from.- Or that everywhere a
post.XXX
column is in the filters or joins, it should be replaced withpost_aggregates.XXX
I also don't know what new indexes should be added, but lets make sure they're actually getting used.
migrations/2023-07-18-082614_post_aggregates_community_id/up.sql
Outdated
Show resolved
Hide resolved
At this point, the difference is very small, but indeed it is a bit faster to treat |
Is there a specific reason the post_aggregates and post table are spilt? In general I think it would make sense to have everything in the same table. (same for comments /person...). 1:1 relations aren't that great in postgresql in general imo, tables with a huge amount of columns work pretty well. It would increase contention a bit because every change in votes would lock the post row for a bit (just writes, not reads) but I don't think that justifies the split. Or is there other reasons for the spilt? |
New commits contain the following changes
|
|
e71461e
to
565df53
Compare
|
||
ALTER TABLE post_aggregates | ||
ALTER COLUMN community_id SET NOT NULL, | ||
ALTER COLUMN creator_id SET NOT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this one, much like the post table has indexes on community_id and creator_id, it would probably be a good idea to add these two indexes to post_aggregates
now too:
IE:
create index idx_post_aggregates_community on post_aggregates (community_id);
create index idx_post_aggregates_creator on post_aggregates (creator_id);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added these indexes as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dessalines I just discovered that adding these two indexes messes up the query plan again 😃
Edit: hmm, I just got some 3-second query plans with the indexes. Then I dropped the indexes - back to a few ms. Then re-added, and it's still at a few ms. So I can't reproduce the problem at the moment, but I will investigate a bit more..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the query plans are definitely consistently worse with those indexes.
With indexes: https://explain.depesz.com/s/W7cF
Without indexes: https://explain.depesz.com/s/4qds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted these indexes for now, let me know if you disagree with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, you didn't add any index to get the speedup? the "right" index should be post_aggregates(community_id, featured_local desc, hot_rank desc)
it should use that one when the user has only subscribed to a few communities, and the idx_featured_local_hot_rank when subscribed to many communities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed a bit on Matrix already, but commenting here as well for visibility:
the "right" index should be post_aggregates(community_id, featured_local desc, hot_rank desc)
This index also messes up the query plan. It indeed makes the queries super fast for users with few subscriptions, but it makes them slow again for others.
Without that index
User with 86 subscriptions: 3ms query execution
User with 0 subscriptions: 700 ms query execution
(I didn't measure exact time, but it's also very fast for logged out users without the index)
With that index
User with 86 subscrptions: 2-3 second query execution
User with 0 subscriptions: 1ms query execution
phiresky on Matrix said:
weird. do the query plans have larger misestimates?
if so try setting default statistics target to 1000 and run analyze again. if not then I don't know why PG is being dumb
I will see tomorrow if playing around with statistics will help. I know there's also CREATE STATISTICS
in postgres, but I've never used it so would need to investigate it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit and sort is the main culprit, but also its weird to me that postgres sometimes chooses a slower index.
We can tweak those indexes later, I'll try to tag the issues as DB
ed91fc0
to
f58d660
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f58d660
to
021d3a6
Compare
…en loading posts (#3653) * Denormalize community_id into post_aggregates for a 1000x speed-up when loading posts * Remove unused index * Add creator_id to post_aggregates * Use post_aggregates as main table for PostQuery * Make post_aggregates the main table for PostView * Reformat SQL
Credit to @phiresky for this idea, originally posted in comments of #2994
This PR adds
community_id
topost_aggregates
(& a new index onpost_aggregates
) to enable joiningcommunity
directly topost_aggregates
when querying posts.On lemm.ee, this optimization speeds up the query for front page of subscribed posts ~1000x, from several seconds to to just milliseconds. You can check a before/after of query plans here: https://gist.github.com/sunaurus/856e03165bb0c0010505afeebde45230