Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry permanent failures #634

Conversation

chrisstaite-menlo
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo commented Jan 25, 2024

Description

The retrier always retries on error. However, some errors are permanent, for example NotFound. This means that if a request is made to get something from the CAS (or GrpcStore AC) which is not there, it keeps trying even though it's still not there. This defines the set of errors which shoudl be retried and fails otherwise.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Remote proxy getting data from the action cache now doesn't retry ad infinitum.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link

vercel bot commented Jan 25, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nativelink-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 31, 2024 5:43pm

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


-- commits line 4 at r1:
Actually, I'm not sure we should encode this strictly in the code, instead lets make this a config on every retry config.

Then make what you have the default.

The reason is because some S3 store implementations will return NotFound even though the item is there because it's eventual consistency. I is silly, but I've personally seen this and if you delay the read just a few micros it comes into existence.

I suggest make a ErrorRetryMapping struct in the config, but super simple, maybe just:

struct ErrorRetryMapping {
  not_found: bool,
  other: bool,
}

We can add more special cases to this later, and the default with be { not_found: false, other: true }.

Thoughts?

@chrisstaite-menlo
Copy link
Collaborator Author

-- commits line 4 at r1:

Previously, allada (Nathan (Blaise) Bruer) wrote…

Actually, I'm not sure we should encode this strictly in the code, instead lets make this a config on every retry config.

Then make what you have the default.

The reason is because some S3 store implementations will return NotFound even though the item is there because it's eventual consistency. I is silly, but I've personally seen this and if you delay the read just a few micros it comes into existence.

I suggest make a ErrorRetryMapping struct in the config, but super simple, maybe just:

struct ErrorRetryMapping {
  not_found: bool,
  other: bool,
}

We can add more special cases to this later, and the default with be { not_found: false, other: true }.

Thoughts?

Sure.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 8 files at r2.
Reviewable status: 0 of 1 LGTMs obtained


nativelink-config/src/stores.rs line 597 at r2 (raw file):

    ///  - Data loss
    /// These default to retrying.
    #[serde(default = "bool_true")]

eeh, I've been trying to avoid setting anything that is not default to a non-default value in the config.

I'm actually thinking if maybe we should do this struct as an enum, then have it:

no_retry_errors: Option<HashSet<ErrorCode>>

The nice thing about doing it this way is ErrorCode can easily be used in other configs in the future (if needed).


nativelink-error/src/lib.rs line 315 at r2 (raw file):

    DataLoss = 15,
    Unauthenticated = 16,
    // NOTE: Additional codes must be added to retry.rs

:-(


nativelink-util/src/retry.rs line 25 at r2 (raw file):

use tracing::debug;

pub struct ExponentialBackoff {

nit: This probably should be private now.

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-config/src/stores.rs line 597 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

eeh, I've been trying to avoid setting anything that is not default to a non-default value in the config.

I'm actually thinking if maybe we should do this struct as an enum, then have it:

no_retry_errors: Option<HashSet<ErrorCode>>

The nice thing about doing it this way is ErrorCode can easily be used in other configs in the future (if needed).

I'd like there to be a set of sensible codes that don't retry by default rather than requiring the user to specify them though, also requiring users to know error codes rather than nice strings seems a bit much.


nativelink-error/src/lib.rs line 315 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

:-(

Can't have exhaustive match outside of a module :'(

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Local / ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable


nativelink-config/src/stores.rs line 597 at r2 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

I'd like there to be a set of sensible codes that don't retry by default rather than requiring the user to specify them though, also requiring users to know error codes rather than nice strings seems a bit much.

Perhaps we could rename this to do_not_retry_other and then invert the boolean?

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Local / ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable


nativelink-util/src/retry.rs line 97 at r2 (raw file):

    fn get_retry_config(&self) -> impl Iterator<Item = Duration> + '_ {
        ExponentialBackoff::new(Duration::from_millis(self.config.delay as u64))

nit: not that we support other backoff functions yet, having backoff as a configuration would be a nice to have, non-blocker and maybe something to explore at a later point in time. Exponential backoff is def most common/graceful in general scenarios

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04)


nativelink-util/src/retry.rs line 97 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: not that we support other backoff functions yet, having backoff as a configuration would be a nice to have, non-blocker and maybe something to explore at a later point in time. Exponential backoff is def most common/graceful in general scenarios

Yes, that was one of the motivations for moving the configuration into Retry so it can be easily modified later without having to modify other parts of the system such as GrpcStore or S3Store.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: 0 of 1 LGTMs obtained


nativelink-config/src/stores.rs line 597 at r2 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

Perhaps we could rename this to do_not_retry_other and then invert the boolean?

Oh, I am saying we should default None to what you have defaulted now. I'm just afraid we are going to lock ourselves in as we add new implementations of stores that use this retry trait. For example, RedisStore might have a spurious bug where the credentials are only updated on failure (sounds dumb, but I'm sure you've seen dumber bugs), to which a user would just want to retry on Unauthenticated to get around this issue, but by default we should not do this.

Again, I don't want to block you, I just want to make sure we design our public api's with good thought.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r2, 1 of 1 files at r4, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained

The retrier always retries on error.  However, some errors are permanent, for
example NotFound.  This means that if a request is made to get something from
the CAS (or GrpcStore AC) which is not there, it keeps trying even though it's
still not there.  This defines the set of errors which shoudl be retried and
fails otherwise.
Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-config/src/stores.rs line 597 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Oh, I am saying we should default None to what you have defaulted now. I'm just afraid we are going to lock ourselves in as we add new implementations of stores that use this retry trait. For example, RedisStore might have a spurious bug where the credentials are only updated on failure (sounds dumb, but I'm sure you've seen dumber bugs), to which a user would just want to retry on Unauthenticated to get around this issue, but by default we should not do this.

Again, I don't want to block you, I just want to make sure we design our public api's with good thought.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

@chrisstaite-menlo chrisstaite-menlo merged commit 81b64f7 into TraceMachina:main Feb 2, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants