Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max concurrent GrpcStore streams #656

Conversation

chrisstaite-menlo
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo commented Feb 3, 2024

Description

Add a configuration option to limit the number of concurrent streams that are active with the upstream GrpcStore.

With this, we also introduce reconnection on error since Tonic doesn't do that.

Towards #602

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link

vercel bot commented Feb 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nativelink-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 5, 2024 5:24pm

Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 8 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained

a discussion (no related file):
nit: It doesn't fix the issue, it is towards that issue.



nativelink-config/src/schedulers.rs line 139 at r1 (raw file):

    /// Limit the number of simultaneous upstream requests to this many.  A
    /// value of zero is treated as unlimited.

nit: It might be good to document what happens if we exceed the limit (ie: do we backpressure/queue or drop).


nativelink-config/src/stores.rs line 551 at r1 (raw file):

    /// Limit the number of simultaneous upstream requests to this many.  A
    /// value of zero is treated as unlimited.

ditto.


nativelink-scheduler/src/grpc_scheduler.rs line 76 at r1 (raw file):

    fn on_error(&self, err: &Error) {
        if err.code != Code::Internal {

nit: Maybe warn!() the error?


nativelink-scheduler/src/grpc_scheduler.rs line 76 at r1 (raw file):

    fn on_error(&self, err: &Error) {
        if err.code != Code::Internal {

Should we use some kind of map of codes to know if we should reconnect?


nativelink-store/src/grpc_store.rs line 223 at r1 (raw file):

/// This handles the permit for limiting concurrency, and also re-connecting the
/// underlying channel on error.  It depends on users reporting all errors.
struct Connection<'a> {

nit: This looks nearly identical to the one above in grpc_scheduler. Can we combine them w/ templates and put them into grpc_utils or something?

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-config/src/stores.rs line 551 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

ditto.

Done.


nativelink-scheduler/src/grpc_scheduler.rs line 76 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Maybe warn!() the error?

That's the purpose of the upstream really.


nativelink-scheduler/src/grpc_scheduler.rs line 76 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Should we use some kind of map of codes to know if we should reconnect?

I don't think we need to, all of the transport errors are mapped to Internal within Tonic: https://github.com/hyperium/tonic/blob/master/tonic/src/status.rs#L376

Notably, it states that these indicate an issue with the library: https://github.com/hyperium/tonic/blob/master/tonic/src/status.rs#L282

Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 7 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained


nativelink-util/src/grpc_utils.rs line 91 at r2 (raw file):

        // new connection now.
        let mut channel_lock = self.parent.channel.lock();
        // Channel already changed, don't do it again.

nit: Can we elaborate this a bit... it took me a while to understand this...

// This ensures only the first error reconnects and the other reconnect
// attempts use the Channel that the first one recreated.

nativelink-util/src/grpc_utils.rs line 92 at r2 (raw file):

        let mut channel_lock = self.parent.channel.lock();
        // Channel already changed, don't do it again.
        if channel_lock.0 != self.channel_id {

I'm pretty sure you need to do this on all channels after they are updated:

self.channel_id = channel_lock.0

(sadly you'll need another mutex :-( )

Failing to do so will cause the second-reconnect attempt to result in get_channel always resulting in a bad channel.

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-util/src/grpc_utils.rs line 92 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I'm pretty sure you need to do this on all channels after they are updated:

self.channel_id = channel_lock.0

(sadly you'll need another mutex :-( )

Failing to do so will cause the second-reconnect attempt to result in get_channel always resulting in a bad channel.

No, a Connection should only be used once. I really wanted to make this easier by impl tower::Service<http::HttpBody> for Connection but that would involve adding a dependency to tower.

Essentially, we create a new Channel in the parent ConnectionManager on failure, but any existing Connection that's being used can complete (and perhaps error) by itself. This is done because in the case of GoAway the other established streams on the Channel are able to complete, but we need to terminate after that graceful disconnect has finished.

Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-util/src/grpc_utils.rs line 92 at r2 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

No, a Connection should only be used once. I really wanted to make this easier by impl tower::Service<http::HttpBody> for Connection but that would involve adding a dependency to tower.

Essentially, we create a new Channel in the parent ConnectionManager on failure, but any existing Connection that's being used can complete (and perhaps error) by itself. This is done because in the case of GoAway the other established streams on the Channel are able to complete, but we need to terminate after that graceful disconnect has finished.

Can we add some of this to a comment? This seems like something that we will foot-gun ourselves on in the future if we don't document it.

Add a configuration option to limit the number of concurrent streams that are
active with the upstream GrpcStore.

With this, we also introduce reconnection on error since Tonic doesn't do
that.
Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, Vercel, asan / ubuntu-22.04, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04


nativelink-util/src/grpc_utils.rs line 92 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Can we add some of this to a comment? This seems like something that we will foot-gun ourselves on in the future if we don't document it.

Done.

Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04 / stable, windows-2022 / stable

@chrisstaite-menlo chrisstaite-menlo merged commit 7548d4b into TraceMachina:main Feb 5, 2024
25 checks passed
Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants