Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PostView queries by doing limit and order *before* other joins #2994

Closed
dessalines opened this issue Jun 9, 2023 · 4 comments
Closed
Labels
area: database enhancement New feature or request

Comments

@dessalines
Copy link
Member

From here

I had a look again @dessalines. It's definitely better - the sorting is not the limiting factor anymore, even though it seems that way to you. (when you take the ordering out, it's faster).

Looking at the query plan, it's all the different joins adding up to a slow time. It's returning 35,507 rows. And it's a lot of work to join all those to different tables.

But why is it so fast when you take out the order by clauses? Well, you have LIMIT 40 in there, and if you don't care about the order, it can grab the first 40 posts it finds (in any order), and only do the other joins for 40 posts.

But as soon as you put in a sort condition (as you found out any sort order - not just hot_rank), it means it needs to get all 35,507 rows with the joined values and can only sort right at the end. Because it can only apply the limit 40 AFTER sorting.

So my suggestion would be to try and limit the number of rows as soon as possible, and join to other rows as late as possible.

For example in shortened pseudo code if you can do this instead:

SELECT ...
...
...
FROM (
SELECT * from post INNER JOIN post_aggregate ON .....
WHERE post.featured_local = true AND .....
ORDER BY post_aggregate.hot_rank LIMIT 40
)
LEFT JOIN ...
LEFT JOIN all the other columns

If you can rewrite it that post and post_aggregate are joined and sorted first, then you can grab the top 40 there, and only do all these other joins afterwards. Which should only join 40 rows instead of 350,000.

It might not quite be as simple. You might have to join to the ban lists and things as well first to make sure you get the correct 40. But the idea is to join only to what you need to filter, then you limit, and then you join to the rest of the info.

If you need to join to too many other things first, then we'll have to think of some other ideas, like batch updating all the filter conditions onto post_aggregate (like we did with hot_rank).

EDIT: similarly to the hot_rank updates you might also try to limit the posts to only the last week. Since it's a decaying function is anything older than a week really "hot"? So it might cut the 35,000 records down a lot. Or since we've indexed it, put WHERE hot_rank >= 1 or whatever works.

Context #2877

cc @Nutomic @johanndt

@dessalines dessalines added enhancement New feature or request area: database labels Jun 9, 2023
@skyraptor777
Copy link

There maybe another way to resolve this. We could estimate before hand what the top 50 rows should be based on stats of a sample. Then do the joins on the 50 to limit them to 40 (50 to give some leeway).
it’s something I had to do for a client once and might work in this case. I can give it a shot if you want.

@dessalines
Copy link
Member Author

I tried and failed to do the above solution, although it is technically possible in diesel using sql_query . I've added two other PRs shown above tho that should mitigate a lot of the PostView performance issues.

To outline some of the difficulties:

  • Diesel doesn't natively support select from subqueries. Its only possible using sql_query .
  • If I do that, I'd need to refactor a lot of the conditions into a sql string, rather than modifying the diesel boxed query. Its possible, but we'd lose a lot of the type checking, and it could make the code really messy.

I'd prefer to hold off on adding the sql_query hack, until we're sure that the above solutions aren't sufficient.

Nutomic pushed a commit that referenced this issue Jul 6, 2023
…t ranks. (#3497)

* Make sure hot rank sorts for post and community filter by positive hot ranks.

- Context #2994

* Adding a comment.
@phiresky
Copy link
Collaborator

phiresky commented Jul 6, 2023

Hey. I want to mention this: One large issue with the post hot rank queries when filtering by community_follower (so the user-specific page) is that the post_aggregate.hot_rank column is in a different table than the post.community_id column. This is relevant because it means that postgresql cannot correlate these things instantly - but what you're looking for is the hottest post for the communities a user follows!

As an example, I've taken the most expensive (average time 350ms) posts query from lemmy.world. Then I did this:

alter table post_aggregates add column community_id integer;
update post_aggregates set community_id=post.community_id from post where post.id = post_aggregates.post_id;

To copy over the community_id to post_aggregates. Then added this index:

create index on post_aggregates (community_id, hot_rank desc);

And changed it to join community_follower with post_aggregates.community_id instead of post.community_id.

This allows PG to get the hottest posts for each community_id.

Doing this, the query goes from 350ms to 7 ms!

Here's the EXPLAIN ANALYZE before the change:

https://explain.depesz.com/s/YSn6

And here it is after the change:

https://explain.depesz.com/s/PEOH

Ok funnily enough, it didn't actually use the index. But still, the query is much faster because it can remove rows more quickly (doesn't have to look up the community_id separately 100k times). And that's even though it had a bad estimate on the rows found by idx_post_aggregates_featured_local_hot.

Edit: found the reason why it wasn't using the index: because it's wrong, the query filters by community_follower.community and sorts by featured_local, hot_rank. Adding an index on (community_id, featured_local desc, hot_rank desc)` fixes it to make the query work as I'd've expected:

https://explain.depesz.com/s/eMS9

@sunaurus
Copy link
Collaborator

I can confirm what @phiresky mentioned above - joining community with post_aggregates directly speeds up my front page query execution times from 3-4 seconds down to 3-4 milliseconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: database enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants