Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads #13587

Merged
merged 1 commit into from
Mar 26, 2025

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Mar 21, 2025

Lower the min_checkpoint_interval substantially to allow quorum queues better control over when checkpoints are taken.

Tracking message bytes written to the log and use this to request a checkpoint every 64MB if no other checkpoint condition was met. This ensures that queues where the messages are very large (1MB+) are checkpointed based on their data ingress rather than indexes.

@kjnilsson kjnilsson force-pushed the qq-checkpointing-tweaks-2 branch 4 times, most recently from bb1988e to e0fe933 Compare March 24, 2025 14:38
@kjnilsson kjnilsson added this to the 4.1.0 milestone Mar 24, 2025
@kjnilsson kjnilsson force-pushed the qq-checkpointing-tweaks-2 branch from e0fe933 to d2a294e Compare March 24, 2025 16:38
@kjnilsson kjnilsson marked this pull request as ready for review March 25, 2025 09:23
@kjnilsson kjnilsson changed the title QQ: tweaks to checkpointing for use cases with fewer larger messages. QQ: revise checkpointing logic Mar 25, 2025
@kjnilsson
Copy link
Contributor Author

kjnilsson commented Mar 25, 2025

Testing:

Try various workloads with both small and large messages and monitor the disk to ensure that there is no undue segment buildup that is retained when workload is paused.

Also ensure segments are cleared up reasonably after queue is emptied / purged.

For large message workloads this PR works best with a lower raft.segment_max_entries configuration, for example 64. This is until we can include rabbitmq/ra#526 which will put an upper byte size limit on segments (in addition to the entry based limit which is 4096 by default)

@kjnilsson kjnilsson requested a review from acogoluegnes March 25, 2025 09:42
@kjnilsson kjnilsson changed the title QQ: revise checkpointing logic QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads Mar 25, 2025
@kjnilsson kjnilsson marked this pull request as draft March 25, 2025 14:24
To take more frequent checkpoints for large message workload

Lower the min_checkpoint_interval substantially to allow quorum queues
better control over when checkpoints are taken.

Track bytes enqueued in the aux state and suggest a checkpoint after
every 64MB enqueued (this value is scaled according to backlog just
like the indexes condition).
This should help with more timely checkpointing when very large
messages is used.

Try evaluating byte size independently of time window

also increase max size
@kjnilsson kjnilsson force-pushed the qq-checkpointing-tweaks-2 branch from 8e4a3aa to 6695282 Compare March 26, 2025 08:25
@kjnilsson kjnilsson marked this pull request as ready for review March 26, 2025 08:27
@mkuratczyk
Copy link
Contributor

We've run lots of tests with different variations of the tweaks. This clearly improves the situation with many slow queues, especially as messages get larger. In this test, all queues are empty (publishers are slow and consumers are present so they immediately get the messages). We can see that main uses more and more disk space, while this branch does not.
Screenshot 2025-03-26 at 11 29 29

@mkuratczyk mkuratczyk self-requested a review March 26, 2025 10:43
@kjnilsson kjnilsson merged commit 26fa541 into main Mar 26, 2025
273 checks passed
@kjnilsson kjnilsson deleted the qq-checkpointing-tweaks-2 branch March 26, 2025 10:43
michaelklishin added a commit that referenced this pull request Mar 26, 2025
QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587)
michaelklishin added a commit that referenced this pull request Mar 26, 2025
QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587) (backport #13621)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants