Skip to content

Conversation

@antiguru
Copy link
Member

@antiguru antiguru commented May 28, 2025

This PR adds a limiter task that periodically confirms that lgalloc's current disk usage is below some configured limit and terminates the process otherwise.

The disk limit is calculated by applying a configurable factor to the process' memory limit. There is also a burst option, allowing the process to use more than the configured disk limit for a time.

The default limits are configured to match the current production reality: The disk limit is twice the memory limit and bursting is disabled.

Motivation

  • This PR adds a known-desirable feature.

Closes MaterializeInc/database-issues/issues/9306

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, started commenting on this before I realized it was still in draft! Submitting since it's already typed out, but feel free to ignore.)

@teskje teskje force-pushed the lgalloc_limiter branch 3 times, most recently from 82f4a0b to 1bde917 Compare June 11, 2025 15:44
@teskje teskje changed the title WIP lgalloc limiter lgalloc disk usage limiter Jun 11, 2025
Comment on lines +193 to +207
warn!(
disk_usage,
disk_limit, "lgalloc disk utilization exceeded configured limits",
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rustfmt showing some bad taste here...

@teskje teskje self-assigned this Jun 11, 2025
@teskje teskje force-pushed the lgalloc_limiter branch from c8eec25 to 04116ff Compare June 12, 2025 10:57
@teskje
Copy link
Contributor

teskje commented Jun 12, 2025

Tested this in my staging env and confirmed:

  • Reaching the configured disk limit terminates the replica.
  • Changing the configured disk limit of a healthy replica to a value lower than its current disk usage terminates the replica.
  • A replica can survive a hydration on the burst budget that it would otherwise not survive.

@teskje teskje force-pushed the lgalloc_limiter branch 4 times, most recently from 8ad653d to 724049f Compare June 13, 2025 17:41
@teskje teskje marked this pull request as ready for review June 14, 2025 13:57
@teskje teskje requested review from a team as code owners June 14, 2025 13:57
@def- def- force-pushed the lgalloc_limiter branch from 724049f to c9e2b9c Compare June 16, 2025 10:02
@def-
Copy link
Contributor

def- commented Jun 16, 2025

Rebased and retriggered nightly: https://buildkite.com/materialize/nightly/builds/12336 (since Hetzner is still struggling with aarch64 availability)

def- and others added 6 commits June 16, 2025 10:40
Also use information from less important jobs
This commit adds a limiter task that periodically confirms that
lgalloc's current disk usage is below some configured limit and
terminates the process otherwise.

The disk limit is calculated by applying a configurable factor to the
process' memory limit. There is also a burst option, allowing the
process to use more than the configured disk limit for a time.

The default limits are configured to match the current production
reality: The disk limit is twice the memory limit and bursting is
disabled.

Signed-off-by: Moritz Hoffmann <mh@materialize.com>
These flags can potentially make tests more unstable, by reducing their
available disk size, so randomly changing them seems like a recipe for
flaky tests.
This commit adds metrics reporting the lgalloc limiter's current disk
limit, disk usage, and burst budget. Having these will be useful in
production, where debug logging is usually switched off.
@def- def- force-pushed the lgalloc_limiter branch from c9e2b9c to 6eced64 Compare June 16, 2025 10:52
Copy link
Member Author

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@teskje
Copy link
Contributor

teskje commented Jun 16, 2025

Hm... Some feature benchmark regressions. I think we need to move the lgalloc limit checker into a separate thread so the I/O it's doing doesn't interfere with other tokio tasks.

@antiguru
Copy link
Member Author

I think we need to move the lgalloc limit checker into a separate thread so the I/O it's doing doesn't interfere with other tokio tasks.

Did you have a chance to look at the metric to see how long its invocations took? We should have enough tokio tasks to not interfere, but who knows!

@teskje
Copy link
Contributor

teskje commented Jun 18, 2025

Did you have a chance to look at the metric to see how long its invocations took?

Just checked in my staging env and it seems like the max invocation time is always 512us... which is also the smallest bucket in the histogram :D The benchmark regression has also gone away after a retry, so seems fine to merge.

@teskje teskje merged commit 3fce8a2 into MaterializeInc:main Jun 18, 2025
277 of 279 checks passed
@antiguru antiguru deleted the lgalloc_limiter branch June 19, 2025 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants