Skip to content

persist: Make sure to obtain a lease before selecting a batch#35554

Merged
bkirwi merged 2 commits intoMaterializeInc:mainfrom
bkirwi:lease-fix
Mar 20, 2026
Merged

persist: Make sure to obtain a lease before selecting a batch#35554
bkirwi merged 2 commits intoMaterializeInc:mainfrom
bkirwi:lease-fix

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Mar 19, 2026

A "seqno lease" is the tool Persist uses internally to prevent garbage collection of a batch that a reader is still processing. It's important that we obtain the lease before we choose the batch to return, to avoid a race where the state changes between the batch being selected and the lease being taken. Unfortunately, callers did this in the wrong order - chose a batch and then obtained a lease for it.

This may have been exacerbated by the recent-ish #34590, which allows more aggressive seqno downgrades to avoid leaks.

Motivation

Incident response - a race here could cause an unexpected read-time halt.

@github-actions
Copy link

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@bkirwi bkirwi force-pushed the lease-fix branch 2 times, most recently from 83ff452 to 61ab3dc Compare March 19, 2026 18:23
@bosconi
Copy link
Member

bosconi commented Mar 19, 2026

bugbot run

@cursor
Copy link

cursor bot commented Mar 19, 2026

PR Summary

Medium Risk
Changes the ordering and waiting logic for snapshot/listen to ensure a seqno lease is obtained before selecting batches, which affects read correctness and could introduce new panics or latency regressions if the wait/upper logic is wrong.

Overview
Refactors persist read paths to wait for the shard upper to advance, then obtain a seqno lease before selecting snapshot/listen batches, preventing races where chosen batches/parts could be GC’d before being leased.

This introduces a shared Machine::wait_for_upper_past primitive (used by Listen::next, ReadHandle::snapshot*, and WriteHandle::wait_for_upper_past), adds RetryParameters::persist_defaults, and updates snapshot stats/parts stats to use a new Machine::unleased_snapshot helper. Metrics are renamed/retargeted from listen/snapshot-specific watch counters to generic wait-for-upper counters.

Written by Cursor Bugbot for commit 61ab3dc. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

@bkirwi bkirwi marked this pull request as ready for review March 19, 2026 18:56
@bkirwi bkirwi requested a review from a team as a code owner March 19, 2026 18:56
Copy link
Contributor

@mtabebe mtabebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does make sense to me given our discussions. And the key thing is that the invariant that the sequence hold is before the actual batch read makes sense.

It also does fix jan's test. I think we should consider merging in jan's repro as well with this change so we have the test.

I don't know that I should be the approver, maybe we should wait for @teskje


fn lease_batch_parts(
&mut self,
lease: Lease,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this enforces that we actually have the lease through the contract of the api

match tokio::time::timeout(min_elapsed, next_batch).await {
Ok(batch) => break batch,
Err(_elapsed) => {
self.handle.maybe_downgrade_since(&self.since).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to drop the maybe_downgrade_since here? in this retry loop.

I think it does make sense because these are disjoint concepts. We should just wait for the lease, not do anything with the since. Just checking...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I see why it's confusing! This loop was a workaround for an issue in an earlier version of the code, where we only relaxed any seqno holds at the same time as we downgraded the since, so we had to time out calls like this and insert calls that would be otherwise noops. As of a couple months ago, we downgrade the seqno in the background thread, so we do not need this sort of noop call. (You can see that in the latest version of this method, this only updates some metadata and doesn't trigger any actual work.)

as_of,
&mut watch,
None,
&self.applier.metrics.retries.snapshot,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it meaningful to relable this metric as unleased_snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a case in mind where I'd want to break these down separately, but it's definitely possible if there's a use-case for it!

@bkirwi bkirwi added the release-blocker Critical issue that should block *any* release if not fixed label Mar 19, 2026
@bkirwi bkirwi requested review from DAlperin, pH14 and teskje March 19, 2026 20:58
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I'm a more useful reviewer than Michael, but this makes sense to me, fwiw!

if !logged_at_info && start.elapsed() >= Duration::from_millis(1024) {
logged_at_info = true;
info!(
"snapshot {} {} as of {:?} not yet available for {} upper {:?}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we lost these logs. Is that fine? I think they have been useful once or twice for me in the past, when debugging why things hang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair enough - let me see what I can do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restored this log, but parameterized to make it make sense in this slightly more generic context. (Though I've hacked it up to only log at info for snapshots, since that's the old behaviour and I think it might be a bit noisy otherwise.)

I also took a second pass in general to try and make sure the behaviour was as 1:1 with the old code as possible, except of course for the stuff we're trying to improve. :) Details in the last commit.

Comment on lines +305 to +310
let lease = self.handle.lease_seqno().await;
let batch = match self
.handle
.machine
.applier
.next_listen_batch(&self.frontier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a minute to convince myself this can't race in a meaningful way. We acquire a lease for whatever version of state we are at. If in between acquiring the lease, and reading the next batch the state advances then we get a batch at a new version of state. But due to gc's handling of seqno_since this lease still serves as a safe lower bound. The only risk is holding the GC back too far, I can imagine a version of next_listen_batch that atomically provides a lease at it's current state version, but this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also got tripped up here going through hypothetical races with state advancing. Maybe worth a safety argument inline for why this works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the trick - the seqno hold protects all future versions of state as well, so it's okay to non-atomically grab a lease and check the state as long as the lease happens first. (And it would be hard to do atomically in any case, since the state is process-global and lots of other handles might be updating it concurrently.)

I've enhanced the comment on lease_seqno with the reasoning here to make that more clear to future readers!

Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay been staring at this for a while, the race diagnosis + switch to lease-then-get-batches behavior change here makes sense to me. Also I like the simplification of wait_for_upper_past over the previous model

I'm not sure if our tests are set up for this at all, but is there any way to write a regression test for this?

bkirwi added 2 commits March 20, 2026 11:41
This recovers some logging, and also restores some other minor behaviour
to its state before this PR. (Including that the retry policy for the
write handle used the listener's retry params.)
@bkirwi
Copy link
Contributor Author

bkirwi commented Mar 20, 2026

I'm not sure if our tests are set up for this at all, but is there any way to write a regression test for this?

@teskje wrote a reproducer for this, though it involves adding some targeted sleeps and isn't something we can merge directly. I've confirmed that it passes on this PR, though. And I hope to follow up with a mergeable version of it when I have a moment...

@bkirwi bkirwi enabled auto-merge (squash) March 20, 2026 15:51
@bkirwi
Copy link
Contributor Author

bkirwi commented Mar 20, 2026

Alright, thank you all for the review!

I'll get this merged so we can pull it in for next week's release.

@bkirwi bkirwi merged commit b33ffcb into MaterializeInc:main Mar 20, 2026
127 checks passed
DAlperin pushed a commit that referenced this pull request Mar 20, 2026
A "seqno lease" is the tool Persist uses internally to prevent garbage
collection of a batch that a reader is still processing. It's important
that we obtain the lease _before_ we choose the batch to return, to
avoid a race where the state changes between the batch being selected
and the lease being taken. Unfortunately, callers did this in the wrong
order - chose a batch and then obtained a lease for it.

This may have been exacerbated by the recent-ish
#34590, which allows
more aggressive seqno downgrades to avoid leaks.

### Motivation

Incident response - a race here could cause an unexpected read-time
halt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-blocker Critical issue that should block *any* release if not fixed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants