Skip to content

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented May 27, 2025

  • Make the code more robust to aggressive compaction limits. (Instead of panicking, we fall back to the already-existing code for when we can't fit two runs in memory.)
  • Exercise a wider range of configs in CI.
  • Fix a txn-wal bug, which could be triggered when data had undergone compaction but wasn't globally consolidated. (Which can happen when memory limits are too small to compact all parts in one go.)

Motivation

Turns out we're missing some test coverage in this area!

Tips for reviewer

I'd love a review from a txn-wal expert, since I'm not that familiar with the code there.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@bkirwi bkirwi force-pushed the ci-tune branch 3 times, most recently from 4707684 to c7f6434 Compare May 28, 2025 19:29
@bkirwi bkirwi marked this pull request as ready for review May 28, 2025 21:54
@bkirwi bkirwi requested review from a team and aljoscha as code owners May 28, 2025 21:54
@bkirwi
Copy link
Contributor Author

bkirwi commented May 28, 2025

For the record, the full CI history: https://buildkite.com/materialize/nightly/builds?branch=bkirwi%3Aci-tune

The first couple runs included the CI tuning, but not the bugfix, rebased on a recent and old release. (To check that it wasn't finding a recently introduced bug.)

The next two runs include the bugfix, and the failures have disappeared.

"16",
"1000",
]
self.flags_with_values["persist_compaction_memory_bound_bytes"] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we limit the size of blobs to 1 MiB, 16 MiB, and 128 MiB. Should those limits be reflected here too? Assuming that we're tuning these to fit a very limited number of Parts/blobs in-memory at a given time during compaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "reflected" - think we should add some smaller sizes here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry I wasn't totally clear here. It seems like the values here for persist_compaction_memory_bound_bytes try to align closely with 1, 4, and 8 blobs, given the default value for persist_blob_target_size. Just wondering if it would be helpful to set smaller values here so when the target blob size is say 16 MiB we still fit only 1, 4, and 8 blobs in the compaction memory bound.

Thinking this through a bit more, adding 64MiB (67108864) here might be interesting because even when the target blob size is small it's still an aggressive memory bound on compaction and can exercise the case when a single blob is larger than our entire bound. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

We'd like to do randomized testing, but that could make this assertion
fire very easily. We already have code to report when the limit's not
high enough but still make progress, so let's lean on that instead.
bkirwi added 3 commits May 30, 2025 15:52
It seems that, when compaction is tuned to more frequently generate
multiple runs, it's possible to see the retraction of the data before
its insertion in this loop. Consolidating means that we'll get a
reasonable snapshot of the data even when timestamps have been
advanced.
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No complaints from QA!

@bkirwi bkirwi merged commit e8e126f into MaterializeInc:main Jun 2, 2025
89 checks passed
@bkirwi bkirwi deleted the ci-tune branch June 4, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants