New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sinks] Limit the number of retries on retriable transaction errors #16374
Conversation
60106ba
to
02ef484
Compare
Seeing the nested result in here reminded me that I've wanted to teach Then once I was in here I poked a bit, and couldn't figure out why we bother aborting the transaction when we're just going to halt anyway. So I pushed up a simplification on that front. This commit is definitely take-it-or-leave-it. Hope this wasn't too surprising! This is usually easier (for me) than opening a PR against your PR. |
Ambivalent! I felt like non-retryable errors were easy enough to express with the old API and not more verbose, so it's not obvious to me that the helper is paying for itself here. But I don't object if you think this will be useful for other callers / we think the nested results could be tricky to reason about... which as the author of those nested results I'm definitely not the right person to have an opinion on. 😅
Aborting the transaction is kind to downstream consumers, since it gives them permission to skip over the messages we've published but won't commit. (Otherwise, they can't make progress until a new version of the sink comes up to fence them out, or we hit the default-15m transaction timeout.) Since we're usually the only publisher to our output topic -- though we can't and don't enforce that -- the benefit is a bit marginal. I think it'd be classy to at least attempt to abort, though maybe it's not worth wrapping retries around? |
Not surprising! Though now that it's happened I realize I don't now how it usually works... should I treat your changes like my commits, and feel free to eg. rewrite or rebase them out? Do I do code review like it was a full PR? Who usually eg. fixes that lint error? |
Yes on both counts: there's a place in testdrive where they'll be very helpful. Your brain must work better than mine because I can't keep the nested results straight; my brain reads
Yes!
Ah! That is context I did not have—but also halting, restarting, and fencing out the old producer seems approximately just as expedient? No strong feels, though I usually bias for designs that don't distinguish between hard failures and graceful failures, and there's something appealing about having the "graceful halt" leave the transaction in the same aborted state that an OOM or a panic would. Gets us more coverage of that code path.
Tag team! So you review my commits and I (or someone else) review yours, and when we have mutual signoff we're good. Part of what makes this work is that we don't require approvals in this codebase, so we're not fighting with GitHub policies about getting the right numbers of reviewers.
I'd say usually whoever gets to it first! |
Great, thanks for the context! I think on balance I still think it's worth attempting the abort, since it's a cheap way to improve the liveness of the sink, though if that fails it makes sense to punt the work to the eventually-restarted storaged like you suggest. I'll work that up and then get this out of draft mode. |
The design philosophy for sinks is that, on persistent errors, the sink should eventually halt. This is especially important now that we track the status of the sink, so errors are reported promptly. Anyways, now we have a smallish retry limit on each of the two operations. I've also converted these back to a more functional style, since otherwise it'd be tedious to track the previous error across loop iterations.
Not all errors are retryable. This commit teaches mz_ore::retry::Retry about trivariate retry statuses: Ok, RetryableErr, and FatalErr. Both Ok and FatalErr immediately end the retry operation, and return Ok or Err to the caller, respectively, while onlya RetryableErr is retried according to the retry operation configuration.
And also leave a comment, since it's not obvious why it's worth the trouble to do at all.
551dcc1
to
36509db
Compare
Ack, works for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant!
Thanks for looking it over! And of course +1 on the retry-helper changes from me as well. |
The design philosophy for sinks is that, on persistent errors, the sink should eventually halt. This is especially important now that we track the status of the sink, so errors are reported promptly.
It's possible for Kafka to be in a state where every transactional operation will fail with a retriable error, and we retry those forever. Instead, this PR converts the two relevant operations to have a smallish retry limit. I've also converted these back to a more functional style, since otherwise it'd be tedious to track the previous error across loop iterations.
Motivation
Helps address a known bug: #16268 (comment)
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way) and therefore is tagged with aT-proto
label.