-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cluster mode when using Redis as a store #998
Conversation
1e568aa
to
89c97db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add tests yet because it felt like a lot of story points to build out infrastructure for running tests against actual redis instances. If necessary, I can block this PR on getting that built out (and ensuring tests pass with a single instance), and then unblock this once that infrastructure lands in main. I'm kind of scared because the reasoning about which commands are allowed in cluster mode vs. which aren't and whether or not you can use a pipeline or transaction aren't enforced at compile time, so I have a lot less confidence in "if it compiles, it works."
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 3 discussions need to be resolved
nativelink-store/src/redis_store.rs
line 68 at r1 (raw file):
// I (@caass) tried a little to make this work but lost a couple days on it and decided to timebox. #[derive(Clone)] pub enum NativelinkRedisConnection {
Nit: This doesn't need to be pub
, it's a holdover from when I was trying to test cluster support before I concluded it was a bigger job than I was comfortable doing in the same PR as this change.
nativelink-store/src/redis_store.rs
line 68 at r1 (raw file):
// I (@caass) tried a little to make this work but lost a couple days on it and decided to timebox. #[derive(Clone)] pub enum NativelinkRedisConnection {
I spent a long time trying to avoid building this wrapper enum; I wanted to just use RedisStore<dyn ConnectionLike>
instead but it kept getting confusing trying to pass around this trait object since we swap it into cells, move it to and from the stack, and all sorts of things during initialization. I was digging into ArcCell
, it was a rabbit hole.
nativelink-store/src/redis_store.rs
line 106 at r1 (raw file):
} } }
FYI there's crates like delegate that remove this boilerplate, but I don't think this is so horrible yet that it warrants bringing in a new dependency.
Code quote:
impl ConnectionLike for NativelinkRedisConnection {
fn req_packed_command<'a>(
&'a mut self,
cmd: &'a redis::Cmd,
) -> redis::RedisFuture<'a, redis::Value> {
match self {
NativelinkRedisConnection::Single(inner) => inner.req_packed_command(cmd),
NativelinkRedisConnection::Cluster(inner) => inner.req_packed_command(cmd),
}
}
fn req_packed_commands<'a>(
&'a mut self,
cmd: &'a redis::Pipeline,
offset: usize,
count: usize,
) -> redis::RedisFuture<'a, Vec<redis::Value>> {
match self {
NativelinkRedisConnection::Single(inner) => {
inner.req_packed_commands(cmd, offset, count)
}
NativelinkRedisConnection::Cluster(inner) => {
inner.req_packed_commands(cmd, offset, count)
}
}
}
fn get_db(&self) -> i64 {
match self {
NativelinkRedisConnection::Single(inner) => inner.get_db(),
NativelinkRedisConnection::Cluster(inner) => inner.get_db(),
}
}
}
nativelink-store/src/redis_store.rs
line 110 at r1 (raw file):
impl NativelinkRedisConnection { /// Create a new connection to a single redis instance pub async fn single<T: redis::IntoConnectionInfo>(params: T) -> Result<Self, Error> {
Nit: Similar comment about the pub
here.
nativelink-store/src/redis_store.rs
line 121 at r1 (raw file):
/// Create a connection to a redis cluster pub async fn cluster<T: redis::IntoConnectionInfo>(
Nit: and here.
nativelink-store/src/redis_store.rs
line 248 at r1 (raw file):
async { let mut conn = self.get_conn().await.err_tip(|| {
The call to get_conn
is moved inside the async block here, rather than reused. But since each call eventually becomes a Arc::get
, I don't think it's a huge deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 3 discussions need to be resolved (waiting on @allada)
89c97db
to
9e16976
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, Publish image, Publish nativelink-worker-lre-cc, and 7 discussions need to be resolved
-- commits
line 7 at r1:
nit: Remove everything here and below, it's not relevant to this change. This would normally just be a comment in a ticket or in the review.
nativelink-store/src/redis_store.rs
line 68 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
Nit: This doesn't need to be
pub
, it's a holdover from when I was trying to test cluster support before I concluded it was a bigger job than I was comfortable doing in the same PR as this change.
Yeah, lets remove pub if possible.
nativelink-store/src/redis_store.rs
line 68 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
I spent a long time trying to avoid building this wrapper enum; I wanted to just use
RedisStore<dyn ConnectionLike>
instead but it kept getting confusing trying to pass around this trait object since we swap it into cells, move it to and from the stack, and all sorts of things during initialization. I was digging intoArcCell
, it was a rabbit hole.
Yeah, i took a small stab at it, it's a pain to work around. Wrapping seems fine.
nativelink-store/src/redis_store.rs
line 77 at r1 (raw file):
&'a mut self, cmd: &'a redis::Cmd, ) -> redis::RedisFuture<'a, redis::Value> {
nit: '_
didnt work?
nativelink-store/src/redis_store.rs
line 106 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
FYI there's crates like delegate that remove this boilerplate, but I don't think this is so horrible yet that it warrants bringing in a new dependency.
This is a neat crate, we might use it in some other areas.
nativelink-store/src/redis_store.rs
line 110 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
Nit: Similar comment about the
pub
here.
nit: remove pub
if we don't need it.
nativelink-store/src/redis_store.rs
line 121 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
Nit: and here.
ditto.
nativelink-store/src/redis_store.rs
line 138 at r1 (raw file):
let conn_fut = if config.addresses.len() == 1 { let addr = config.addresses[0].clone(); NativelinkRedisConnection::single(addr).boxed().shared()
I don't believe this is what single/cluster means. There's an actual different mode redis will be in and I believe the clients must know if the server is a cluster server or single server.
We probably need to introduce a config that the user needs to specify if redis is a cluster or non-cluster, but please read/check this.
nativelink-store/src/redis_store.rs
line 248 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
The call to
get_conn
is moved inside the async block here, rather than reused. But since each call eventually becomes aArc::get
, I don't think it's a huge deal.
I would suggest doing it more similar to what it used to do. This reduces the need for a bunch of needless branches, since we will have a connection, we can just clone it before passing it into the future.
nativelink-store/src/redis_store.rs
line 258 at r1 (raw file):
} }) .collect::<FuturesOrdered<_>>();
Instead lets use .buffered()
And limit it to something like 50. This will help reduce an edge case where we have an insane number of futures outstanding all fighting each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r1.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, Publish image, Publish nativelink-worker-lre-cc, and 10 discussions need to be resolved (waiting on @allada)
nativelink-store/src/redis_store.rs
line 61 at r1 (raw file):
/// and [cluster](https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/) modes.) // // Theoretically, we shouldn't need this as [`RedisStore`] uses an [`ArcCell`] internally to hold its connection.
Are we trying to convey this is something we should change or a future TODO
? If so lets make it a TODO(name):
.
I think we are getting to a point where TODO
should probably be ticketed and tracked every time they happen. We are not fully there yet.
nativelink-store/src/redis_store.rs
line 66 at r1 (raw file):
// always inside a [`Result`] enum, whose arguments must be sized. // // I (@caass) tried a little to make this work but lost a couple days on it and decided to timebox.
This comment doesn't seem relevant on a technical level, lets remove.
nativelink-store/src/redis_store.rs
line 197 at r1 (raw file):
/// Encode a [`StoreKey`] so it can be sent to Redis. // This doesn't really need to return an `impl`, but I want to avoid relying specifically on
Lets drop the use of I
in the code comments and use a TODO(name)
if it intended to be refactored at a later time. Generally doc comments shouldn't use I
as many people are working on the code base.
7f4d754
to
94c37b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, made some pretty sweeping changes here. Whoops...
Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, and 6 discussions need to be resolved (waiting on @allada)
nativelink-store/src/redis_store.rs
line 61 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Are we trying to convey this is something we should change or a future
TODO
? If so lets make it aTODO(name):
.I think we are getting to a point where
TODO
should probably be ticketed and tracked every time they happen. We are not fully there yet.
Marking this as resolved b/c the code changed pretty significantly.
nativelink-store/src/redis_store.rs
line 66 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
This comment doesn't seem relevant on a technical level, lets remove.
Marking this as resolved b/c the code changed pretty significantly.
nativelink-store/src/redis_store.rs
line 77 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit:
'_
didnt work?
No :(
nativelink-store/src/redis_store.rs
line 138 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I don't believe this is what single/cluster means. There's an actual different mode redis will be in and I believe the clients must know if the server is a cluster server or single server.
We probably need to introduce a config that the user needs to specify if redis is a cluster or non-cluster, but please read/check this.
Yeah, you're right about this.
nativelink-store/src/redis_store.rs
line 197 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Lets drop the use of
I
in the code comments and use aTODO(name)
if it intended to be refactored at a later time. Generally doc comments shouldn't useI
as many people are working on the code base.
Ok, will do.
nativelink-store/src/redis_store.rs
line 258 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Instead lets use .buffered()
And limit it to something like 50. This will help reduce an edge case where we have an insane number of futures outstanding all fighting each other.
I'm running into a problem here trying to use futures::stream::iter
where we actually define the StoreDriver
trait such that the relationship between the lifetime of the borrowed slice of StoreKey
s and the data inside those StoreKey
s is undefined -- that is to say, according to the rust compiler, the data inside the StoreKey
s could change before the reference is made invalid. A more correct definition would look like this:
#[async_trait]
pub trait StoreDriver: Sync + Send + Unpin + HealthStatusIndicator + 'static {
/// See: [`StoreLike::has_with_results`] for details.
async fn has_with_results<'a, 'b: 'a>(
self: Pin<&Self>,
digests: &'a [StoreKey<'b>],
results: &mut [Option<usize>],
) -> Result<(), Error>;
}
At least, I think that's the problem...the actual error message I'm getting is:
error: implementation of `std::ops::FnOnce` is not general enough
--> nativelink-store/src/redis_store.rs:345:28
|
345 | ) -> Result<(), Error> {
| ____________________________^
346 | | if keys.len() == 1 && is_zero_digest(keys[0].borrow()) {
347 | | results[0] = Some(0);
348 | | return Ok(());
... |
391 | | Ok(())
392 | | }
| |_____^ implementation of `std::ops::FnOnce` is not general enough
|
= note: closure with signature `fn((usize, &nativelink_util::store_trait::StoreKey<'0>)) -> {async block@nativelink-store/src/redis_store.rs:359:13: 369:14}` must implement `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'1>),)>`, for any two lifetimes `'0` and `'1`...
= note: ...but it actually implements `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'_>),)>`
which is kind of...nonsensical.
94c37b7
to
3dc3109
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 4 discussions need to be resolved (waiting on @allada)
nativelink-config/src/stores.rs
line 898 at r3 (raw file):
/// Default: false, #[serde(default)] pub cluster: bool,
Would an enum
type be a better option for different modes or connection types?
3dc3109
to
1f6dd50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, all commit messages.
Dismissed @allada from a discussion.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Local / ubuntu-22.04, Vercel, pre-commit-checks, vale, and 1 discussions need to be resolved (waiting on @allada)
nativelink-config/src/stores.rs
line 898 at r3 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Would an
enum
type be a better option for different modes or connection types?
Potentially, the only other option seems to be Sentinel So we can make that happen. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, windows-2022 / stable, and 1 discussions need to be resolved (waiting on @allada)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, Installation / macos-13, Installation / macos-14, Local / ubuntu-22.04, integration-tests (20.04), integration-tests (22.04), macos-13, and 1 discussions need to be resolved (waiting on @allada)
1f6dd50
to
035af96
Compare
This commit also refactors the background connection mechanism to be generic over either singleton or cluster connections.
035af96
to
b078048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @allada from a discussion.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @allada)
nativelink-store/src/redis_store.rs
line 258 at r1 (raw file):
Previously, caass (Cass Fridkin) wrote…
I'm running into a problem here trying to use
futures::stream::iter
where we actually define theStoreDriver
trait such that the relationship between the lifetime of the borrowed slice ofStoreKey
s and the data inside thoseStoreKey
s is undefined -- that is to say, according to the rust compiler, the data inside theStoreKey
s could change before the reference is made invalid. A more correct definition would look like this:#[async_trait] pub trait StoreDriver: Sync + Send + Unpin + HealthStatusIndicator + 'static { /// See: [`StoreLike::has_with_results`] for details. async fn has_with_results<'a, 'b: 'a>( self: Pin<&Self>, digests: &'a [StoreKey<'b>], results: &mut [Option<usize>], ) -> Result<(), Error>; }At least, I think that's the problem...the actual error message I'm getting is:
error: implementation of `std::ops::FnOnce` is not general enough --> nativelink-store/src/redis_store.rs:345:28 | 345 | ) -> Result<(), Error> { | ____________________________^ 346 | | if keys.len() == 1 && is_zero_digest(keys[0].borrow()) { 347 | | results[0] = Some(0); 348 | | return Ok(()); ... | 391 | | Ok(()) 392 | | } | |_____^ implementation of `std::ops::FnOnce` is not general enough | = note: closure with signature `fn((usize, &nativelink_util::store_trait::StoreKey<'0>)) -> {async block@nativelink-store/src/redis_store.rs:359:13: 369:14}` must implement `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'1>),)>`, for any two lifetimes `'0` and `'1`... = note: ...but it actually implements `std::ops::FnOnce<((usize, &nativelink_util::store_trait::StoreKey<'_>),)>`
which is kind of...nonsensical.
I'm going to address this in a follow-up PR once the trait is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-@allada will add a buffer to the redis connection in a follow-up PR
Reviewable status: complete! 2 of 1 LGTMs obtained
Description
This PR brings support for using Redis in cluster mode as a store.
Fixes #869
Type of change
Please delete options that aren't relevant.
not work as expected)
How Has This Been Tested?
Notably, no new tests have been added. Since we currently use a mock to test our redis support, we don't have infrastructure in place to test against a live redis cluster.
I'd like to add some tests that spin up a redis cluster and assert that everything works ok in cluster mode, but it's a non-trivial amount of work. I'd like for them to run using the rust test harness, so likely that'd mean grouping them under a cargo feature (like, "slow-tests") and using testcontainers to create a redis cluster.
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is