QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads #13587

kjnilsson · 2025-03-21T14:32:44Z

Lower the min_checkpoint_interval substantially to allow quorum queues better control over when checkpoints are taken.

Tracking message bytes written to the log and use this to request a checkpoint every 64MB if no other checkpoint condition was met. This ensures that queues where the messages are very large (1MB+) are checkpointed based on their data ingress rather than indexes.

kjnilsson · 2025-03-25T09:36:03Z

Testing:

Try various workloads with both small and large messages and monitor the disk to ensure that there is no undue segment buildup that is retained when workload is paused.

Also ensure segments are cleared up reasonably after queue is emptied / purged.

For large message workloads this PR works best with a lower raft.segment_max_entries configuration, for example 64. This is until we can include rabbitmq/ra#526 which will put an upper byte size limit on segments (in addition to the entry based limit which is 4096 by default)

To take more frequent checkpoints for large message workload Lower the min_checkpoint_interval substantially to allow quorum queues better control over when checkpoints are taken. Track bytes enqueued in the aux state and suggest a checkpoint after every 64MB enqueued (this value is scaled according to backlog just like the indexes condition). This should help with more timely checkpointing when very large messages is used. Try evaluating byte size independently of time window also increase max size

mkuratczyk · 2025-03-26T10:39:09Z

We've run lots of tests with different variations of the tweaks. This clearly improves the situation with many slow queues, especially as messages get larger. In this test, all queues are empty (publishers are slow and consumers are present so they immediately get the messages). We can see that main uses more and more disk space, while this branch does not.

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587)

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587) (backport #13621)

kjnilsson force-pushed the qq-checkpointing-tweaks-2 branch 4 times, most recently from bb1988e to e0fe933 Compare March 24, 2025 14:38

kjnilsson added backport-v4.0.x backport-v4.1.x labels Mar 24, 2025

kjnilsson added this to the 4.1.0 milestone Mar 24, 2025

kjnilsson force-pushed the qq-checkpointing-tweaks-2 branch from e0fe933 to d2a294e Compare March 24, 2025 16:38

kjnilsson marked this pull request as ready for review March 25, 2025 09:23

kjnilsson changed the title ~~QQ: tweaks to checkpointing for use cases with fewer larger messages.~~ QQ: revise checkpointing logic Mar 25, 2025

kjnilsson requested a review from acogoluegnes March 25, 2025 09:42

kjnilsson changed the title ~~QQ: revise checkpointing logic~~ QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads Mar 25, 2025

acogoluegnes approved these changes Mar 25, 2025

View reviewed changes

kjnilsson marked this pull request as draft March 25, 2025 14:24

kjnilsson force-pushed the qq-checkpointing-tweaks-2 branch from 8e4a3aa to 6695282 Compare March 26, 2025 08:25

kjnilsson marked this pull request as ready for review March 26, 2025 08:27

mkuratczyk self-requested a review March 26, 2025 10:43

mkuratczyk approved these changes Mar 26, 2025

View reviewed changes

kjnilsson merged commit 26fa541 into main Mar 26, 2025
273 checks passed

kjnilsson deleted the qq-checkpointing-tweaks-2 branch March 26, 2025 10:43

mergify bot mentioned this pull request Mar 26, 2025

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587) #13621

Merged

michaelklishin added a commit that referenced this pull request Mar 26, 2025

Merge pull request #13621 from rabbitmq/mergify/bp/v4.1.x/pr-13587

338973d

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587)

mergify bot mentioned this pull request Mar 26, 2025

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587) (backport #13621) #13622

Merged

michaelklishin added a commit that referenced this pull request Mar 26, 2025

Merge pull request #13622 from rabbitmq/mergify/bp/v4.0.x/pr-13621

3581ed5

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads (backport #13587) (backport #13621)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads #13587

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads #13587

kjnilsson commented Mar 21, 2025 •

edited

Loading

kjnilsson commented Mar 25, 2025 •

edited

Loading

mkuratczyk commented Mar 26, 2025

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads #13587

QQ: Revise checkpointing logic to take more frequent checkpoints for large message workloads #13587

Conversation

kjnilsson commented Mar 21, 2025 • edited Loading

kjnilsson commented Mar 25, 2025 • edited Loading

mkuratczyk commented Mar 26, 2025

kjnilsson commented Mar 21, 2025 •

edited

Loading

kjnilsson commented Mar 25, 2025 •

edited

Loading