Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sinks] Limit the number of retries on retriable transaction errors #16374

Merged
merged 3 commits into from Dec 1, 2022

Conversation

bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Nov 29, 2022

The design philosophy for sinks is that, on persistent errors, the sink should eventually halt. This is especially important now that we track the status of the sink, so errors are reported promptly.

It's possible for Kafka to be in a state where every transactional operation will fail with a retriable error, and we retry those forever. Instead, this PR converts the two relevant operations to have a smallish retry limit. I've also converted these back to a more functional style, since otherwise it'd be tedious to track the previous error across loop iterations.

Motivation

Helps address a known bug: #16268 (comment)

Checklist

@bkirwi bkirwi force-pushed the sink-error-retries branch 3 times, most recently from 60106ba to 02ef484 Compare November 29, 2022 22:44
@benesch
Copy link
Member

benesch commented Nov 30, 2022

Seeing the nested result in here reminded me that I've wanted to teach Retry about the difference between retryable and non-retryable errors for a while. I pushed up a commit to do that. WDYT?

Then once I was in here I poked a bit, and couldn't figure out why we bother aborting the transaction when we're just going to halt anyway. So I pushed up a simplification on that front. This commit is definitely take-it-or-leave-it.

Hope this wasn't too surprising! This is usually easier (for me) than opening a PR against your PR.

@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 30, 2022

Seeing the nested result in here reminded me that I've wanted to teach Retry about the difference between retryable and non-retryable errors for a while. I pushed up a commit to do that. WDYT?

Ambivalent! I felt like non-retryable errors were easy enough to express with the old API and not more verbose, so it's not obvious to me that the helper is paying for itself here. But I don't object if you think this will be useful for other callers / we think the nested results could be tricky to reason about... which as the author of those nested results I'm definitely not the right person to have an opinion on. 😅

Then once I was in here I poked a bit, and couldn't figure out why we bother aborting the transaction when we're just going to halt anyway. So I pushed up a simplification on that front. This commit is definitely take-it-or-leave-it.

Aborting the transaction is kind to downstream consumers, since it gives them permission to skip over the messages we've published but won't commit. (Otherwise, they can't make progress until a new version of the sink comes up to fence them out, or we hit the default-15m transaction timeout.) Since we're usually the only publisher to our output topic -- though we can't and don't enforce that -- the benefit is a bit marginal.

I think it'd be classy to at least attempt to abort, though maybe it's not worth wrapping retries around?

@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 30, 2022

Hope this wasn't too surprising! This is usually easier (for me) than opening a PR against your PR.

Not surprising! Though now that it's happened I realize I don't now how it usually works... should I treat your changes like my commits, and feel free to eg. rewrite or rebase them out? Do I do code review like it was a full PR? Who usually eg. fixes that lint error?

@benesch
Copy link
Member

benesch commented Dec 1, 2022

if you think this will be useful for other callers / we think the nested results could be tricky to reason about... which as the author of those nested results I'm definitely not the right person to have an opinion on.

Yes on both counts: there's a place in testdrive where they'll be very helpful. Your brain must work better than mine because I can't keep the nested results straight; my brain reads Ok(Err(_)) as a retryable error ("an okay error") and Err(_) as a fatal error, but that's exactly backwards.

should I treat your changes like my commits, and feel free to eg. rewrite or rebase them out?

Yes!

Aborting the transaction is kind to downstream consumers, since it gives them permission to skip over the messages we've published but won't commit. (Otherwise, they can't make progress until a new version of the sink comes up to fence them out, or we hit the default-15m transaction timeout.) Since we're usually the only publisher to our output topic -- though we can't and don't enforce that -- the benefit is a bit marginal.

Ah! That is context I did not have—but also halting, restarting, and fencing out the old producer seems approximately just as expedient? No strong feels, though I usually bias for designs that don't distinguish between hard failures and graceful failures, and there's something appealing about having the "graceful halt" leave the transaction in the same aborted state that an OOM or a panic would. Gets us more coverage of that code path.

Do I do code review like it was a full PR?

Tag team! So you review my commits and I (or someone else) review yours, and when we have mutual signoff we're good. Part of what makes this work is that we don't require approvals in this codebase, so we're not fighting with GitHub policies about getting the right numbers of reviewers.

Who usually eg. fixes that lint error?

I'd say usually whoever gets to it first!

@bkirwi
Copy link
Contributor Author

bkirwi commented Dec 1, 2022

Great, thanks for the context!

I think on balance I still think it's worth attempting the abort, since it's a cheap way to improve the liveness of the sink, though if that fails it makes sense to punt the work to the eventually-restarted storaged like you suggest. I'll work that up and then get this out of draft mode.

bkirwi and others added 3 commits December 1, 2022 11:37
The design philosophy for sinks is that, on persistent errors, the sink
should eventually halt. This is especially important now that we track
the status of the sink, so errors are reported promptly.

Anyways, now we have a smallish retry limit on each of the two
operations. I've also converted these back to a more functional style,
since otherwise it'd be tedious to track the previous error across loop
iterations.
Not all errors are retryable. This commit teaches mz_ore::retry::Retry
about trivariate retry statuses: Ok, RetryableErr, and FatalErr. Both Ok
and FatalErr immediately end the retry operation, and return Ok or Err
to the caller, respectively, while onlya RetryableErr is retried
according to the retry operation configuration.
And also leave a comment, since it's not obvious why it's worth the
trouble to do at all.
@benesch
Copy link
Member

benesch commented Dec 1, 2022

I think on balance I still think it's worth attempting the abort, since it's a cheap way to improve the liveness of the sink, though if that fails it makes sense to punt the work to the eventually-restarted storaged like you suggest. I'll work that up and then get this out of draft mode.

Ack, works for me.

@bkirwi bkirwi marked this pull request as ready for review December 1, 2022 21:32
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!

@bkirwi
Copy link
Contributor Author

bkirwi commented Dec 1, 2022

Thanks for looking it over! And of course +1 on the retry-helper changes from me as well.

@bkirwi bkirwi merged commit f33b7a0 into MaterializeInc:main Dec 1, 2022
@bkirwi bkirwi deleted the sink-error-retries branch December 2, 2022 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants