-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't retry permanent failures #634
Don't retry permanent failures #634
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained
-- commits
line 4 at r1:
Actually, I'm not sure we should encode this strictly in the code, instead lets make this a config on every retry config.
Then make what you have the default.
The reason is because some S3 store implementations will return NotFound
even though the item is there because it's eventual consistency. I is silly, but I've personally seen this and if you delay the read just a few micros it comes into existence.
I suggest make a ErrorRetryMapping
struct in the config, but super simple, maybe just:
struct ErrorRetryMapping {
not_found: bool,
other: bool,
}
We can add more special cases to this later, and the default with be { not_found: false, other: true }
.
Thoughts?
Previously, allada (Nathan (Blaise) Bruer) wrote…
Sure. |
59aac44
to
5751a5d
Compare
5751a5d
to
7c8a40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 8 files at r2.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-config/src/stores.rs
line 597 at r2 (raw file):
/// - Data loss /// These default to retrying. #[serde(default = "bool_true")]
eeh, I've been trying to avoid setting anything that is not default to a non-default value in the config.
I'm actually thinking if maybe we should do this struct as an enum, then have it:
no_retry_errors: Option<HashSet<ErrorCode>>
The nice thing about doing it this way is ErrorCode
can easily be used in other configs in the future (if needed).
nativelink-error/src/lib.rs
line 315 at r2 (raw file):
DataLoss = 15, Unauthenticated = 16, // NOTE: Additional codes must be added to retry.rs
:-(
nativelink-util/src/retry.rs
line 25 at r2 (raw file):
use tracing::debug; pub struct ExponentialBackoff {
nit: This probably should be private now.
7c8a40e
to
1d29918
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-config/src/stores.rs
line 597 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
eeh, I've been trying to avoid setting anything that is not default to a non-default value in the config.
I'm actually thinking if maybe we should do this struct as an enum, then have it:
no_retry_errors: Option<HashSet<ErrorCode>>The nice thing about doing it this way is
ErrorCode
can easily be used in other configs in the future (if needed).
I'd like there to be a set of sensible codes that don't retry by default rather than requiring the user to specify them though, also requiring users to know error codes rather than nice strings seems a bit much.
nativelink-error/src/lib.rs
line 315 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
:-(
Can't have exhaustive match
outside of a module :'(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Local / ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable
nativelink-config/src/stores.rs
line 597 at r2 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
I'd like there to be a set of sensible codes that don't retry by default rather than requiring the user to specify them though, also requiring users to know error codes rather than nice strings seems a bit much.
Perhaps we could rename this to do_not_retry_other
and then invert the boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Local / ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable
nativelink-util/src/retry.rs
line 97 at r2 (raw file):
fn get_retry_config(&self) -> impl Iterator<Item = Duration> + '_ { ExponentialBackoff::new(Duration::from_millis(self.config.delay as u64))
nit: not that we support other backoff functions yet, having backoff as a configuration would be a nice to have, non-blocker and maybe something to explore at a later point in time. Exponential backoff is def most common/graceful in general scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04)
nativelink-util/src/retry.rs
line 97 at r2 (raw file):
Previously, adam-singer (Adam Singer) wrote…
nit: not that we support other backoff functions yet, having backoff as a configuration would be a nice to have, non-blocker and maybe something to explore at a later point in time. Exponential backoff is def most common/graceful in general scenarios
Yes, that was one of the motivations for moving the configuration into Retry
so it can be easily modified later without having to modify other parts of the system such as GrpcStore
or S3Store
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-config/src/stores.rs
line 597 at r2 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
Perhaps we could rename this to
do_not_retry_other
and then invert the boolean?
Oh, I am saying we should default None
to what you have defaulted now. I'm just afraid we are going to lock ourselves in as we add new implementations of stores that use this retry trait. For example, RedisStore
might have a spurious bug where the credentials are only updated on failure (sounds dumb, but I'm sure you've seen dumber bugs), to which a user would just want to retry on Unauthenticated
to get around this issue, but by default we should not do this.
Again, I don't want to block you, I just want to make sure we design our public api's with good thought.
1d29918
to
d75ec73
Compare
d75ec73
to
6c8d1f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r2, 1 of 1 files at r4, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained
The retrier always retries on error. However, some errors are permanent, for example NotFound. This means that if a request is made to get something from the CAS (or GrpcStore AC) which is not there, it keeps trying even though it's still not there. This defines the set of errors which shoudl be retried and fails otherwise.
6c8d1f6
to
b580698
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-config/src/stores.rs
line 597 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Oh, I am saying we should default
None
to what you have defaulted now. I'm just afraid we are going to lock ourselves in as we add new implementations of stores that use this retry trait. For example,RedisStore
might have a spurious bug where the credentials are only updated on failure (sounds dumb, but I'm sure you've seen dumber bugs), to which a user would just want to retry onUnauthenticated
to get around this issue, but by default we should not do this.Again, I don't want to block you, I just want to make sure we design our public api's with good thought.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: complete! 1 of 1 LGTMs obtained
Description
The retrier always retries on error. However, some errors are permanent, for example NotFound. This means that if a request is made to get something from the CAS (or GrpcStore AC) which is not there, it keeps trying even though it's still not there. This defines the set of errors which shoudl be retried and fails otherwise.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Remote proxy getting data from the action cache now doesn't retry ad infinitum.
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is