[sinks] Limit the number of retries on retriable transaction errors #16374

bkirwi · 2022-11-29T22:23:24Z

The design philosophy for sinks is that, on persistent errors, the sink should eventually halt. This is especially important now that we track the status of the sink, so errors are reported promptly.

It's possible for Kafka to be in a state where every transactional operation will fail with a retriable error, and we retry those forever. Instead, this PR converts the two relevant operations to have a smallish retry limit. I've also converted these back to a more functional style, since otherwise it'd be tedious to track the previous error across loop iterations.

Motivation

Helps address a known bug: #16268 (comment)

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
This PR includes the following user-facing behavior changes:

benesch · 2022-11-30T03:34:39Z

Seeing the nested result in here reminded me that I've wanted to teach Retry about the difference between retryable and non-retryable errors for a while. I pushed up a commit to do that. WDYT?

Then once I was in here I poked a bit, and couldn't figure out why we bother aborting the transaction when we're just going to halt anyway. So I pushed up a simplification on that front. This commit is definitely take-it-or-leave-it.

Hope this wasn't too surprising! This is usually easier (for me) than opening a PR against your PR.

bkirwi · 2022-11-30T18:07:56Z

Seeing the nested result in here reminded me that I've wanted to teach Retry about the difference between retryable and non-retryable errors for a while. I pushed up a commit to do that. WDYT?

Ambivalent! I felt like non-retryable errors were easy enough to express with the old API and not more verbose, so it's not obvious to me that the helper is paying for itself here. But I don't object if you think this will be useful for other callers / we think the nested results could be tricky to reason about... which as the author of those nested results I'm definitely not the right person to have an opinion on. 😅

Then once I was in here I poked a bit, and couldn't figure out why we bother aborting the transaction when we're just going to halt anyway. So I pushed up a simplification on that front. This commit is definitely take-it-or-leave-it.

Aborting the transaction is kind to downstream consumers, since it gives them permission to skip over the messages we've published but won't commit. (Otherwise, they can't make progress until a new version of the sink comes up to fence them out, or we hit the default-15m transaction timeout.) Since we're usually the only publisher to our output topic -- though we can't and don't enforce that -- the benefit is a bit marginal.

I think it'd be classy to at least attempt to abort, though maybe it's not worth wrapping retries around?

bkirwi · 2022-11-30T18:59:41Z

Hope this wasn't too surprising! This is usually easier (for me) than opening a PR against your PR.

Not surprising! Though now that it's happened I realize I don't now how it usually works... should I treat your changes like my commits, and feel free to eg. rewrite or rebase them out? Do I do code review like it was a full PR? Who usually eg. fixes that lint error?

benesch · 2022-12-01T05:32:15Z

if you think this will be useful for other callers / we think the nested results could be tricky to reason about... which as the author of those nested results I'm definitely not the right person to have an opinion on.

Yes on both counts: there's a place in testdrive where they'll be very helpful. Your brain must work better than mine because I can't keep the nested results straight; my brain reads Ok(Err(_)) as a retryable error ("an okay error") and Err(_) as a fatal error, but that's exactly backwards.

should I treat your changes like my commits, and feel free to eg. rewrite or rebase them out?

Yes!

Aborting the transaction is kind to downstream consumers, since it gives them permission to skip over the messages we've published but won't commit. (Otherwise, they can't make progress until a new version of the sink comes up to fence them out, or we hit the default-15m transaction timeout.) Since we're usually the only publisher to our output topic -- though we can't and don't enforce that -- the benefit is a bit marginal.

Ah! That is context I did not have—but also halting, restarting, and fencing out the old producer seems approximately just as expedient? No strong feels, though I usually bias for designs that don't distinguish between hard failures and graceful failures, and there's something appealing about having the "graceful halt" leave the transaction in the same aborted state that an OOM or a panic would. Gets us more coverage of that code path.

Do I do code review like it was a full PR?

Tag team! So you review my commits and I (or someone else) review yours, and when we have mutual signoff we're good. Part of what makes this work is that we don't require approvals in this codebase, so we're not fighting with GitHub policies about getting the right numbers of reviewers.

Who usually eg. fixes that lint error?

I'd say usually whoever gets to it first!

bkirwi · 2022-12-01T16:36:26Z

Great, thanks for the context!

I think on balance I still think it's worth attempting the abort, since it's a cheap way to improve the liveness of the sink, though if that fails it makes sense to punt the work to the eventually-restarted storaged like you suggest. I'll work that up and then get this out of draft mode.

The design philosophy for sinks is that, on persistent errors, the sink should eventually halt. This is especially important now that we track the status of the sink, so errors are reported promptly. Anyways, now we have a smallish retry limit on each of the two operations. I've also converted these back to a more functional style, since otherwise it'd be tedious to track the previous error across loop iterations.

Not all errors are retryable. This commit teaches mz_ore::retry::Retry about trivariate retry statuses: Ok, RetryableErr, and FatalErr. Both Ok and FatalErr immediately end the retry operation, and return Ok or Err to the caller, respectively, while onlya RetryableErr is retried according to the retry operation configuration.

And also leave a comment, since it's not obvious why it's worth the trouble to do at all.

benesch · 2022-12-01T16:54:25Z

I think on balance I still think it's worth attempting the abort, since it's a cheap way to improve the liveness of the sink, though if that fails it makes sense to punt the work to the eventually-restarted storaged like you suggest. I'll work that up and then get this out of draft mode.

Ack, works for me.

benesch

Brilliant!

bkirwi · 2022-12-01T23:34:48Z

Thanks for looking it over! And of course +1 on the retry-helper changes from me as well.

bkirwi force-pushed the sink-error-retries branch 3 times, most recently from 60106ba to 02ef484 Compare November 29, 2022 22:44

bkirwi and others added 3 commits December 1, 2022 11:37

Don't bother retrying the transaction abort

36509db

And also leave a comment, since it's not obvious why it's worth the trouble to do at all.

bkirwi force-pushed the sink-error-retries branch from 551dcc1 to 36509db Compare December 1, 2022 16:53

bkirwi marked this pull request as ready for review December 1, 2022 21:32

bkirwi requested review from aljoscha and benesch December 1, 2022 21:34

benesch approved these changes Dec 1, 2022

View reviewed changes

bkirwi merged commit f33b7a0 into MaterializeInc:main Dec 1, 2022

bkirwi deleted the sink-error-retries branch December 2, 2022 22:31

materialize-bot mentioned this pull request Dec 6, 2022

release: v0.34.0 required reviews #16496

Closed

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sinks] Limit the number of retries on retriable transaction errors #16374

[sinks] Limit the number of retries on retriable transaction errors #16374

bkirwi commented Nov 29, 2022 •

edited

benesch commented Nov 30, 2022

bkirwi commented Nov 30, 2022 •

edited

bkirwi commented Nov 30, 2022

benesch commented Dec 1, 2022 •

edited

bkirwi commented Dec 1, 2022

benesch commented Dec 1, 2022

benesch left a comment

bkirwi commented Dec 1, 2022

[sinks] Limit the number of retries on retriable transaction errors #16374

[sinks] Limit the number of retries on retriable transaction errors #16374

Conversation

bkirwi commented Nov 29, 2022 • edited

Motivation

Checklist

benesch commented Nov 30, 2022

bkirwi commented Nov 30, 2022 • edited

bkirwi commented Nov 30, 2022

benesch commented Dec 1, 2022 • edited

bkirwi commented Dec 1, 2022

benesch commented Dec 1, 2022

benesch left a comment

Choose a reason for hiding this comment

bkirwi commented Dec 1, 2022

bkirwi commented Nov 29, 2022 •

edited

bkirwi commented Nov 30, 2022 •

edited

benesch commented Dec 1, 2022 •

edited