Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change comparisons in cursor-based pagination #4287

Closed
wants to merge 14 commits into from

Conversation

dullbananas
Copy link
Collaborator

Currently, posts are compared with the cursor with a condition like this:

(a >= 1) AND (a != 1 OR b >= 2) AND (a != 1 OR b != 2 OR b >= 3)

This PR changes it to:

(a > 1) OR (a = 1 AND b > 2) OR (a = 1 AND b = 2 AND c > 3)

This should make better use of the index. At worst it should be 3 very efficient index scans that are combined together. The 3 comparison groups closely resemble examples of index-friendly conditions in the PostgreSQL manual. Because they are combined with OR, the scan of items that aren't included in the final result should only be caused by other filters that don't use the index, such as deleted = false.

@dullbananas dullbananas marked this pull request as ready for review December 18, 2023 03:16
@dullbananas dullbananas marked this pull request as draft December 18, 2023 03:16
@dullbananas
Copy link
Collaborator Author

query plan still dum

@phiresky
Copy link
Collaborator

phiresky commented Dec 18, 2023

thanks so much for your effort. did you try the tuple comparison? I don't think it should need much new index(es) because almost all the sorts are purely descending (featured_local DESC, scaled_rank DESC, published DESC), only the "old sort" has ascending in there I think

@dullbananas
Copy link
Collaborator Author

Query plan not dum. Me dum.

I increased the number of posts, and now it uses the index for all pagination filters (except post_id which is not in the index).

I tried tuple comparison on one of the sorts, and that makes it use 1 index scan instead of 3, which is more simple, so it might be somewhat faster, but it would not affect the amount of index-scanned rows. I don't think it's worth the trouble of doing the ugly stuff needed for tuple comparisons with mixed sorting directions, but I could make tuple comparison automatically done whenever possible in the pagination library I'm working on.

I tried it again without this PR but still with many posts, and confirmed that it previously couldn't use the index in this way, so this PR will improve performance a lot.

->  Bitmap Heap Scan on post_aggregates  (cost=13.07..62.03 rows=16 width=106) (actual time=0.260..1.821 rows=4690 loops=1)
		Recheck Cond: (((community_id = 14) AND (featured_community < false)) OR ((community_id = 14) AND (NOT featured_community) AND (comments < '0'::bigint)) OR ((community_id = 14) AND (NOT featured_community) AND (comments = '0'::bigint)))
		Filter: ((featured_community < false) OR ((comments < '0'::bigint) AND (NOT featured_community)) OR ((post_id < 4712) AND (NOT featured_community) AND (comments = '0'::bigint)))
		Rows Removed by Filter: 1310
		Heap Blocks: exact=117
		->  BitmapOr  (cost=13.07..13.07 rows=18 width=0) (actual time=0.247..0.248 rows=0 loops=1)
			->  Bitmap Index Scan on idx_post_aggregates_featured_community_score  (cost=0.00..4.40 rows=12 width=0) (actual time=0.003..0.003 rows=0 loops=1)
					Index Cond: ((community_id = 14) AND (featured_community < false))
			->  Bitmap Index Scan on idx_post_aggregates_featured_community_most_comments  (cost=0.00..4.36 rows=6 width=0) (actual time=0.002..0.002 rows=0 loops=1)
					Index Cond: ((community_id = 14) AND (featured_community = false) AND (comments < '0'::bigint))
			->  Bitmap Index Scan on idx_post_aggregates_featured_community_most_comments  (cost=0.00..4.29 rows=1 width=0) (actual time=0.242..0.242 rows=6000 loops=1)
					Index Cond: ((community_id = 14) AND (featured_community = false) AND (comments = '0'::bigint))

@dullbananas dullbananas reopened this Dec 19, 2023
@dullbananas dullbananas marked this pull request as ready for review December 19, 2023 00:38
@phiresky
Copy link
Collaborator

phiresky commented Dec 19, 2023

just fyi, a bitmap index scan is not a "real" index scan depending on definitions and can be much worse than a normal index scan.

"The bitmap is one bit per heap page. The bitmap index scan sets the bits based on the heap page address that the index entry points to.

So when it goes to do the bitmap heap scan, it just does a linear table scan, reading the bitmap to see whether it should bother with a particular page or seek over it." https://dba.stackexchange.com/questions/119386/understanding-bitmap-heap-scan-and-bitmap-index-scan

to test you should probably also put at least 1 millions rows in your table because otherwise the results may be completely different from the real world (PG knows scanning through 1000 rows doesn't matter so it does it)

@dullbananas
Copy link
Collaborator Author

dullbananas commented Dec 20, 2023

The index scans are oddly inconsistent. This is with 999999 posts, each with different timestamps:

->  Index Scan using idx_post_aggregates_featured_community_published on post_aggregates  (cost=0.42..77988.07 rows=311457 width=106) (actual time=5.691..5.701 rows=21 loops=1)
		Index Cond: (community_id = 2)
		Filter: ((featured_community < false) OR ((published < '2023-12-20 19:31:41.673269+00'::timestamp with time zone) AND (NOT featured_community)) OR ((post_id < 673364) AND (NOT featured_community) AND (published = '2023-12-20 19:31:41.673269+00'::timestamp with time zone)))
		Rows Removed by Filter: 19980

The command used (#4285):

scripts/db_perf.sh --read-post-pages 1000 --posts 1000000

@phiresky
Copy link
Collaborator

Index scan with only community id in condition is still really bad because that means it reads all rows from the community into memory. condition should be doing 99% of the filtering, filter should ideally be empty. You can tell by "filtered rows" being the number ofposts in the community 2 you're looking at - 10.

That was exactly why I added that prefetch/upper boind function because PG was not able to understand the hot posts query enough and the index filters were making it fetch all community posts into RAM. (should be in the discussion in #3872 )

also remember that the expensive queries are the ones looking at subscribed where it has to be able to filter by multiple community IDs simultaneously

@dullbananas
Copy link
Collaborator Author

Replaced by #4320

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants