[persist] Lower some thresholds for persist in CI #32598

bkirwi · 2025-05-27T21:43:35Z

Make the code more robust to aggressive compaction limits. (Instead of panicking, we fall back to the already-existing code for when we can't fit two runs in memory.)
Exercise a wider range of configs in CI.
Fix a txn-wal bug, which could be triggered when data had undergone compaction but wasn't globally consolidated. (Which can happen when memory limits are too small to compact all parts in one go.)

Motivation

Turns out we're missing some test coverage in this area!

Tips for reviewer

I'd love a review from a txn-wal expert, since I'm not that familiar with the code there.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

bkirwi · 2025-05-28T22:00:21Z

For the record, the full CI history: https://buildkite.com/materialize/nightly/builds?branch=bkirwi%3Aci-tune

The first couple runs included the CI tuning, but not the bugfix, rebased on a recent and old release. (To check that it wasn't finding a recently introduced bug.)

The next two runs include the bugfix, and the failures have disappeared.

ParkMyCar · 2025-05-29T14:13:48Z

misc/python/materialize/parallel_workload/action.py

            "16",
            "1000",
        ]
+        self.flags_with_values["persist_compaction_memory_bound_bytes"] = [


Above we limit the size of blobs to 1 MiB, 16 MiB, and 128 MiB. Should those limits be reflected here too? Assuming that we're tuning these to fit a very limited number of Parts/blobs in-memory at a given time during compaction.

What do you mean by "reflected" - think we should add some smaller sizes here as well?

Yeah sorry I wasn't totally clear here. It seems like the values here for persist_compaction_memory_bound_bytes try to align closely with 1, 4, and 8 blobs, given the default value for persist_blob_target_size. Just wondering if it would be helpful to set smaller values here so when the target blob size is say 16 MiB we still fit only 1, 4, and 8 blobs in the compaction memory bound.

Thinking this through a bit more, adding 64MiB (67108864) here might be interesting because even when the target blob size is small it's still an aggressive memory bound on compaction and can exercise the case when a single blob is larger than our entire bound. wdyt?

Done, thanks!

We'd like to do randomized testing, but that could make this assertion fire very easily. We already have code to report when the limit's not high enough but still make progress, so let's lean on that instead.

It seems that, when compaction is tuned to more frequently generate multiple runs, it's possible to see the retraction of the data before its insertion in this loop. Consolidating means that we'll get a reasonable snapshot of the data even when timestamps have been advanced.

def-

No complaints from QA!

bkirwi force-pushed the ci-tune branch 3 times, most recently from 4707684 to c7f6434 Compare May 28, 2025 19:29

bkirwi marked this pull request as ready for review May 28, 2025 21:54

bkirwi requested review from a team and aljoscha as code owners May 28, 2025 21:54

ParkMyCar approved these changes May 29, 2025

View reviewed changes

Allow lower compaction limits for testing

35726eb

We'd like to do randomized testing, but that could make this assertion fire very easily. We already have code to report when the limit's not high enough but still make progress, so let's lean on that instead.

bkirwi force-pushed the ci-tune branch from c7f6434 to 847de99 Compare May 30, 2025 18:47

bkirwi added 3 commits May 30, 2025 15:52

Toggle the desired batch size in parallel workload tests

c6e93bd

Lower the bound on the compaction size

eb0489c

bkirwi force-pushed the ci-tune branch from 847de99 to 4a94b2e Compare May 30, 2025 19:52

def- approved these changes Jun 2, 2025

View reviewed changes

bkirwi merged commit e8e126f into MaterializeInc:main Jun 2, 2025
89 checks passed

bkirwi deleted the ci-tune branch June 4, 2025 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[persist] Lower some thresholds for persist in CI #32598

[persist] Lower some thresholds for persist in CI #32598

Uh oh!

bkirwi commented May 27, 2025 •

edited

Loading

Uh oh!

bkirwi commented May 28, 2025

Uh oh!

ParkMyCar May 29, 2025

Uh oh!

bkirwi May 29, 2025

Uh oh!

ParkMyCar May 29, 2025

Uh oh!

bkirwi Jun 2, 2025

Uh oh!

def- left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[persist] Lower some thresholds for persist in CI #32598

[persist] Lower some thresholds for persist in CI #32598

Uh oh!

Conversation

bkirwi commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

bkirwi commented May 28, 2025

Uh oh!

ParkMyCar May 29, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi May 29, 2025

Choose a reason for hiding this comment

Uh oh!

ParkMyCar May 29, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkirwi commented May 27, 2025 •

edited

Loading