persist: Make sure to obtain a lease before selecting a batch#35554
persist: Make sure to obtain a lease before selecting a batch#35554bkirwi merged 2 commits intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
83ff452 to
61ab3dc
Compare
|
bugbot run |
PR SummaryMedium Risk Overview This introduces a shared Written by Cursor Bugbot for commit 61ab3dc. This will update automatically on new commits. Configure here. |
mtabebe
left a comment
There was a problem hiding this comment.
This change does make sense to me given our discussions. And the key thing is that the invariant that the sequence hold is before the actual batch read makes sense.
It also does fix jan's test. I think we should consider merging in jan's repro as well with this change so we have the test.
I don't know that I should be the approver, maybe we should wait for @teskje
|
|
||
| fn lease_batch_parts( | ||
| &mut self, | ||
| lease: Lease, |
There was a problem hiding this comment.
Cool, this enforces that we actually have the lease through the contract of the api
| match tokio::time::timeout(min_elapsed, next_batch).await { | ||
| Ok(batch) => break batch, | ||
| Err(_elapsed) => { | ||
| self.handle.maybe_downgrade_since(&self.since).await; |
There was a problem hiding this comment.
Is it intentional to drop the maybe_downgrade_since here? in this retry loop.
I think it does make sense because these are disjoint concepts. We should just wait for the lease, not do anything with the since. Just checking...
There was a problem hiding this comment.
Yeah, but I see why it's confusing! This loop was a workaround for an issue in an earlier version of the code, where we only relaxed any seqno holds at the same time as we downgraded the since, so we had to time out calls like this and insert calls that would be otherwise noops. As of a couple months ago, we downgrade the seqno in the background thread, so we do not need this sort of noop call. (You can see that in the latest version of this method, this only updates some metadata and doesn't trigger any actual work.)
| as_of, | ||
| &mut watch, | ||
| None, | ||
| &self.applier.metrics.retries.snapshot, |
There was a problem hiding this comment.
Is it meaningful to relable this metric as unleased_snapshot?
There was a problem hiding this comment.
I don't have a case in mind where I'd want to break these down separately, but it's definitely possible if there's a use-case for it!
teskje
left a comment
There was a problem hiding this comment.
I'm not sure that I'm a more useful reviewer than Michael, but this makes sense to me, fwiw!
| if !logged_at_info && start.elapsed() >= Duration::from_millis(1024) { | ||
| logged_at_info = true; | ||
| info!( | ||
| "snapshot {} {} as of {:?} not yet available for {} upper {:?}", |
There was a problem hiding this comment.
Looks like we lost these logs. Is that fine? I think they have been useful once or twice for me in the past, when debugging why things hang.
There was a problem hiding this comment.
Yeah, fair enough - let me see what I can do!
There was a problem hiding this comment.
I've restored this log, but parameterized to make it make sense in this slightly more generic context. (Though I've hacked it up to only log at info for snapshots, since that's the old behaviour and I think it might be a bit noisy otherwise.)
I also took a second pass in general to try and make sure the behaviour was as 1:1 with the old code as possible, except of course for the stuff we're trying to improve. :) Details in the last commit.
| let lease = self.handle.lease_seqno().await; | ||
| let batch = match self | ||
| .handle | ||
| .machine | ||
| .applier | ||
| .next_listen_batch(&self.frontier) |
There was a problem hiding this comment.
Took me a minute to convince myself this can't race in a meaningful way. We acquire a lease for whatever version of state we are at. If in between acquiring the lease, and reading the next batch the state advances then we get a batch at a new version of state. But due to gc's handling of seqno_since this lease still serves as a safe lower bound. The only risk is holding the GC back too far, I can imagine a version of next_listen_batch that atomically provides a lease at it's current state version, but this is fine.
There was a problem hiding this comment.
I also got tripped up here going through hypothetical races with state advancing. Maybe worth a safety argument inline for why this works
There was a problem hiding this comment.
Yeah, that's the trick - the seqno hold protects all future versions of state as well, so it's okay to non-atomically grab a lease and check the state as long as the lease happens first. (And it would be hard to do atomically in any case, since the state is process-global and lots of other handles might be updating it concurrently.)
I've enhanced the comment on lease_seqno with the reasoning here to make that more clear to future readers!
pH14
left a comment
There was a problem hiding this comment.
Okay been staring at this for a while, the race diagnosis + switch to lease-then-get-batches behavior change here makes sense to me. Also I like the simplification of wait_for_upper_past over the previous model
I'm not sure if our tests are set up for this at all, but is there any way to write a regression test for this?
This recovers some logging, and also restores some other minor behaviour to its state before this PR. (Including that the retry policy for the write handle used the listener's retry params.)
@teskje wrote a reproducer for this, though it involves adding some targeted sleeps and isn't something we can merge directly. I've confirmed that it passes on this PR, though. And I hope to follow up with a mergeable version of it when I have a moment... |
|
Alright, thank you all for the review! I'll get this merged so we can pull it in for next week's release. |
A "seqno lease" is the tool Persist uses internally to prevent garbage collection of a batch that a reader is still processing. It's important that we obtain the lease _before_ we choose the batch to return, to avoid a race where the state changes between the batch being selected and the lease being taken. Unfortunately, callers did this in the wrong order - chose a batch and then obtained a lease for it. This may have been exacerbated by the recent-ish #34590, which allows more aggressive seqno downgrades to avoid leaks. ### Motivation Incident response - a race here could cause an unexpected read-time halt.

A "seqno lease" is the tool Persist uses internally to prevent garbage collection of a batch that a reader is still processing. It's important that we obtain the lease before we choose the batch to return, to avoid a race where the state changes between the batch being selected and the lease being taken. Unfortunately, callers did this in the wrong order - chose a batch and then obtained a lease for it.
This may have been exacerbated by the recent-ish #34590, which allows more aggressive seqno downgrades to avoid leaks.
Motivation
Incident response - a race here could cause an unexpected read-time halt.