Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cluster mode when using Redis as a store #998

Merged
merged 1 commit into from
Jun 19, 2024

Conversation

caass
Copy link
Member

@caass caass commented Jun 13, 2024

Description

This PR brings support for using Redis in cluster mode as a store.

Fixes #869

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Notably, no new tests have been added. Since we currently use a mock to test our redis support, we don't have infrastructure in place to test against a live redis cluster.

I'd like to add some tests that spin up a redis cluster and assert that everything works ok in cluster mode, but it's a non-trivial amount of work. I'd like for them to run using the rust test harness, so likely that'd mean grouping them under a cargo feature (like, "slow-tests") and using testcontainers to create a redis cluster.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@caass caass force-pushed the caass/redis-cluster-support branch from 1e568aa to 89c97db Compare June 13, 2024 15:55
Copy link
Member Author

@caass caass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add tests yet because it felt like a lot of story points to build out infrastructure for running tests against actual redis instances. If necessary, I can block this PR on getting that built out (and ensuring tests pass with a single instance), and then unblock this once that infrastructure lands in main. I'm kind of scared because the reasoning about which commands are allowed in cluster mode vs. which aren't and whether or not you can use a pipeline or transaction aren't enforced at compile time, so I have a lot less confidence in "if it compiles, it works."

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 3 discussions need to be resolved


nativelink-store/src/redis_store.rs line 68 at r1 (raw file):

// I (@caass) tried a little to make this work but lost a couple days on it and decided to timebox.
#[derive(Clone)]
pub enum NativelinkRedisConnection {

Nit: This doesn't need to be pub, it's a holdover from when I was trying to test cluster support before I concluded it was a bigger job than I was comfortable doing in the same PR as this change.


nativelink-store/src/redis_store.rs line 68 at r1 (raw file):

// I (@caass) tried a little to make this work but lost a couple days on it and decided to timebox.
#[derive(Clone)]
pub enum NativelinkRedisConnection {

I spent a long time trying to avoid building this wrapper enum; I wanted to just use RedisStore<dyn ConnectionLike> instead but it kept getting confusing trying to pass around this trait object since we swap it into cells, move it to and from the stack, and all sorts of things during initialization. I was digging into ArcCell, it was a rabbit hole.


nativelink-store/src/redis_store.rs line 106 at r1 (raw file):

        }
    }
}

FYI there's crates like delegate that remove this boilerplate, but I don't think this is so horrible yet that it warrants bringing in a new dependency.

Code quote:

impl ConnectionLike for NativelinkRedisConnection {
    fn req_packed_command<'a>(
        &'a mut self,
        cmd: &'a redis::Cmd,
    ) -> redis::RedisFuture<'a, redis::Value> {
        match self {
            NativelinkRedisConnection::Single(inner) => inner.req_packed_command(cmd),
            NativelinkRedisConnection::Cluster(inner) => inner.req_packed_command(cmd),
        }
    }

    fn req_packed_commands<'a>(
        &'a mut self,
        cmd: &'a redis::Pipeline,
        offset: usize,
        count: usize,
    ) -> redis::RedisFuture<'a, Vec<redis::Value>> {
        match self {
            NativelinkRedisConnection::Single(inner) => {
                inner.req_packed_commands(cmd, offset, count)
            }
            NativelinkRedisConnection::Cluster(inner) => {
                inner.req_packed_commands(cmd, offset, count)
            }
        }
    }

    fn get_db(&self) -> i64 {
        match self {
            NativelinkRedisConnection::Single(inner) => inner.get_db(),
            NativelinkRedisConnection::Cluster(inner) => inner.get_db(),
        }
    }
}

nativelink-store/src/redis_store.rs line 110 at r1 (raw file):

impl NativelinkRedisConnection {
    /// Create a new connection to a single redis instance
    pub async fn single<T: redis::IntoConnectionInfo>(params: T) -> Result<Self, Error> {

Nit: Similar comment about the pub here.


nativelink-store/src/redis_store.rs line 121 at r1 (raw file):

    /// Create a connection to a redis cluster
    pub async fn cluster<T: redis::IntoConnectionInfo>(

Nit: and here.


nativelink-store/src/redis_store.rs line 248 at r1 (raw file):

                    async {
                        let mut conn = self.get_conn().await.err_tip(|| {

The call to get_conn is moved inside the async block here, rather than reused. But since each call eventually becomes a Arc::get, I don't think it's a huge deal.

Copy link
Member Author

@caass caass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@allada

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 3 discussions need to be resolved (waiting on @allada)

@caass caass force-pushed the caass/redis-cluster-support branch from 89c97db to 9e16976 Compare June 13, 2024 18:58
allada
allada previously requested changes Jun 13, 2024
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, Publish image, Publish nativelink-worker-lre-cc, and 7 discussions need to be resolved


-- commits line 7 at r1:
nit: Remove everything here and below, it's not relevant to this change. This would normally just be a comment in a ticket or in the review.


nativelink-store/src/redis_store.rs line 68 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

Nit: This doesn't need to be pub, it's a holdover from when I was trying to test cluster support before I concluded it was a bigger job than I was comfortable doing in the same PR as this change.

Yeah, lets remove pub if possible.


nativelink-store/src/redis_store.rs line 68 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

I spent a long time trying to avoid building this wrapper enum; I wanted to just use RedisStore<dyn ConnectionLike> instead but it kept getting confusing trying to pass around this trait object since we swap it into cells, move it to and from the stack, and all sorts of things during initialization. I was digging into ArcCell, it was a rabbit hole.

Yeah, i took a small stab at it, it's a pain to work around. Wrapping seems fine.


nativelink-store/src/redis_store.rs line 77 at r1 (raw file):

        &'a mut self,
        cmd: &'a redis::Cmd,
    ) -> redis::RedisFuture<'a, redis::Value> {

nit: '_ didnt work?


nativelink-store/src/redis_store.rs line 106 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

FYI there's crates like delegate that remove this boilerplate, but I don't think this is so horrible yet that it warrants bringing in a new dependency.

This is a neat crate, we might use it in some other areas.


nativelink-store/src/redis_store.rs line 110 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

Nit: Similar comment about the pub here.

nit: remove pub if we don't need it.


nativelink-store/src/redis_store.rs line 121 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

Nit: and here.

ditto.


nativelink-store/src/redis_store.rs line 138 at r1 (raw file):

        let conn_fut = if config.addresses.len() == 1 {
            let addr = config.addresses[0].clone();
            NativelinkRedisConnection::single(addr).boxed().shared()

I don't believe this is what single/cluster means. There's an actual different mode redis will be in and I believe the clients must know if the server is a cluster server or single server.

We probably need to introduce a config that the user needs to specify if redis is a cluster or non-cluster, but please read/check this.


nativelink-store/src/redis_store.rs line 248 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

The call to get_conn is moved inside the async block here, rather than reused. But since each call eventually becomes a Arc::get, I don't think it's a huge deal.

I would suggest doing it more similar to what it used to do. This reduces the need for a bunch of needless branches, since we will have a connection, we can just clone it before passing it into the future.


nativelink-store/src/redis_store.rs line 258 at r1 (raw file):

                    }
                })
                .collect::<FuturesOrdered<_>>();

Instead lets use .buffered()

And limit it to something like 50. This will help reduce an edge case where we have an insane number of futures outstanding all fighting each other.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r1.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, Publish image, Publish nativelink-worker-lre-cc, and 10 discussions need to be resolved (waiting on @allada)


nativelink-store/src/redis_store.rs line 61 at r1 (raw file):

/// and [cluster](https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/) modes.)
//
// Theoretically, we shouldn't need this as [`RedisStore`] uses an [`ArcCell`] internally to hold its connection.

Are we trying to convey this is something we should change or a future TODO? If so lets make it a TODO(name):.

I think we are getting to a point where TODO should probably be ticketed and tracked every time they happen. We are not fully there yet.


nativelink-store/src/redis_store.rs line 66 at r1 (raw file):

// always inside a [`Result`] enum, whose arguments must be sized.
//
// I (@caass) tried a little to make this work but lost a couple days on it and decided to timebox.

This comment doesn't seem relevant on a technical level, lets remove.


nativelink-store/src/redis_store.rs line 197 at r1 (raw file):

    /// Encode a [`StoreKey`] so it can be sent to Redis.
    // This doesn't really need to return an `impl`, but I want to avoid relying specifically on

Lets drop the use of I in the code comments and use a TODO(name) if it intended to be refactored at a later time. Generally doc comments shouldn't use I as many people are working on the code base.

@caass caass force-pushed the caass/redis-cluster-support branch 2 times, most recently from 7f4d754 to 94c37b7 Compare June 17, 2024 17:20
Copy link
Member Author

@caass caass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, made some pretty sweeping changes here. Whoops...

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, and 6 discussions need to be resolved (waiting on @allada)


nativelink-store/src/redis_store.rs line 61 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Are we trying to convey this is something we should change or a future TODO? If so lets make it a TODO(name):.

I think we are getting to a point where TODO should probably be ticketed and tracked every time they happen. We are not fully there yet.

Marking this as resolved b/c the code changed pretty significantly.


nativelink-store/src/redis_store.rs line 66 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

This comment doesn't seem relevant on a technical level, lets remove.

Marking this as resolved b/c the code changed pretty significantly.


nativelink-store/src/redis_store.rs line 77 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: '_ didnt work?

No :(


nativelink-store/src/redis_store.rs line 138 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I don't believe this is what single/cluster means. There's an actual different mode redis will be in and I believe the clients must know if the server is a cluster server or single server.

We probably need to introduce a config that the user needs to specify if redis is a cluster or non-cluster, but please read/check this.

Yeah, you're right about this.


nativelink-store/src/redis_store.rs line 197 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Lets drop the use of I in the code comments and use a TODO(name) if it intended to be refactored at a later time. Generally doc comments shouldn't use I as many people are working on the code base.

Ok, will do.


nativelink-store/src/redis_store.rs line 258 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Instead lets use .buffered()

And limit it to something like 50. This will help reduce an edge case where we have an insane number of futures outstanding all fighting each other.

I'm running into a problem here trying to use futures::stream::iter where we actually define the StoreDriver trait such that the relationship between the lifetime of the borrowed slice of StoreKeys and the data inside those StoreKeys is undefined -- that is to say, according to the rust compiler, the data inside the StoreKeys could change before the reference is made invalid. A more correct definition would look like this:

#[async_trait]
pub trait StoreDriver: Sync + Send + Unpin + HealthStatusIndicator + 'static {
    /// See: [`StoreLike::has_with_results`] for details.
    async fn has_with_results<'a, 'b: 'a>(
        self: Pin<&Self>,
        digests: &'a [StoreKey<'b>],
        results: &mut [Option<usize>],
    ) -> Result<(), Error>;
}

At least, I think that's the problem...the actual error message I'm getting is:

error: implementation of `std::ops::FnOnce` is not general enough
   --> nativelink-store/src/redis_store.rs:345:28
    |
345 |       ) -> Result<(), Error> {
    |  ____________________________^
346 | |         if keys.len() == 1 && is_zero_digest(keys[0].borrow()) {
347 | |             results[0] = Some(0);
348 | |             return Ok(());
...   |
391 | |         Ok(())
392 | |     }
    | |_____^ implementation of `std::ops::FnOnce` is not general enough
    |
    = note: closure with signature `fn((usize, &nativelink_util::store_trait::StoreKey<'0>)) -> {async block@nativelink-store/src/redis_store.rs:359:13: 369:14}` must implement `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'1>),)>`, for any two lifetimes `'0` and `'1`...
    = note: ...but it actually implements `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'_>),)>`

which is kind of...nonsensical.

@caass caass force-pushed the caass/redis-cluster-support branch from 94c37b7 to 3dc3109 Compare June 17, 2024 19:41
@caass caass requested review from adam-singer and allada June 18, 2024 15:37
Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 4 discussions need to be resolved (waiting on @allada)


nativelink-config/src/stores.rs line 898 at r3 (raw file):

    /// Default: false,
    #[serde(default)]
    pub cluster: bool,

Would an enum type be a better option for different modes or connection types?

@caass caass force-pushed the caass/redis-cluster-support branch from 3dc3109 to 1f6dd50 Compare June 18, 2024 20:16
Copy link
Member Author

@caass caass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r4, all commit messages.
Dismissed @allada from a discussion.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Local / ubuntu-22.04, Vercel, pre-commit-checks, vale, and 1 discussions need to be resolved (waiting on @allada)


nativelink-config/src/stores.rs line 898 at r3 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Would an enum type be a better option for different modes or connection types?

Potentially, the only other option seems to be Sentinel So we can make that happen. Done.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, windows-2022 / stable, and 1 discussions need to be resolved (waiting on @allada)

Copy link
Contributor

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, Installation / macos-13, Installation / macos-14, Local / ubuntu-22.04, integration-tests (20.04), integration-tests (22.04), macos-13, and 1 discussions need to be resolved (waiting on @allada)

@caass caass force-pushed the caass/redis-cluster-support branch from 1f6dd50 to 035af96 Compare June 19, 2024 00:57
@caass caass enabled auto-merge (squash) June 19, 2024 01:30
This commit also refactors the background connection mechanism to be
generic over either singleton or cluster connections.
@caass caass force-pushed the caass/redis-cluster-support branch from 035af96 to b078048 Compare June 19, 2024 02:02
Copy link
Member Author

@caass caass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @allada from a discussion.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @allada)


nativelink-store/src/redis_store.rs line 258 at r1 (raw file):

Previously, caass (Cass Fridkin) wrote…

I'm running into a problem here trying to use futures::stream::iter where we actually define the StoreDriver trait such that the relationship between the lifetime of the borrowed slice of StoreKeys and the data inside those StoreKeys is undefined -- that is to say, according to the rust compiler, the data inside the StoreKeys could change before the reference is made invalid. A more correct definition would look like this:

#[async_trait]
pub trait StoreDriver: Sync + Send + Unpin + HealthStatusIndicator + 'static {
    /// See: [`StoreLike::has_with_results`] for details.
    async fn has_with_results<'a, 'b: 'a>(
        self: Pin<&Self>,
        digests: &'a [StoreKey<'b>],
        results: &mut [Option<usize>],
    ) -> Result<(), Error>;
}

At least, I think that's the problem...the actual error message I'm getting is:

error: implementation of `std::ops::FnOnce` is not general enough
   --> nativelink-store/src/redis_store.rs:345:28
    |
345 |       ) -> Result<(), Error> {
    |  ____________________________^
346 | |         if keys.len() == 1 && is_zero_digest(keys[0].borrow()) {
347 | |             results[0] = Some(0);
348 | |             return Ok(());
...   |
391 | |         Ok(())
392 | |     }
    | |_____^ implementation of `std::ops::FnOnce` is not general enough
    |
    = note: closure with signature `fn((usize, &nativelink_util::store_trait::StoreKey<'0>)) -> {async block@nativelink-store/src/redis_store.rs:359:13: 369:14}` must implement `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'1>),)>`, for any two lifetimes `'0` and `'1`...
    = note: ...but it actually implements `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'_>),)>`

which is kind of...nonsensical.

I'm going to address this in a follow-up PR once the trait is modified.

Copy link
Member Author

@caass caass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-@allada will add a buffer to the redis connection in a follow-up PR

Reviewable status: :shipit: complete! 2 of 1 LGTMs obtained

@caass caass dismissed allada’s stale review June 19, 2024 11:16

Will add buffering in a follow-up PR

@caass caass merged commit c85b6df into TraceMachina:main Jun 19, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redis Store Connection should support Cluster and Connection Manager
4 participants