Skip to content

Conversation

@amankrx
Copy link
Collaborator

@amankrx amankrx commented Mar 26, 2025

Description

This PR implements the Google Cloud Store implementation and closely aligns with the AWS S3 implementation.
This is an update to my previous PR I raised using the gRPC implementation. There were multiple issues with that PR, and the primary issue was the stability. This PR fixes all the underlying issues we were facing with that PR. We are using the google-cloud-storage and google-cloud-auth crates.

Note: To run the gcs_backend.json5 example that I have provided, you should ensure that the path to the credentials file is specified in the GOOGLE_APPLICATION_CREDENTIALS environment variable. Check https://docs.rs/google-cloud-storage/0.24.0/google_cloud_storage/#automatically for more details.

Fixes #659
/claim #659

Type of change

Please delete options that aren't relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

I did an E2E testing using the gcs_backend.json5 example that I have provided. Along with that, I have set up a mocking framework for the GCS client and added a few tests for both the GCS store and the GCS client.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@MarcusSorealheis
Copy link
Collaborator

The only issue now is your private key.

@amankrx
Copy link
Collaborator Author

amankrx commented Mar 26, 2025

The only issue now is your private key.

Fixed it!

Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get through all of it, but at the moment I'd say that this would introduce too much maintenance overhead considering that this is all a single store. If this is the amount of logic necessary to support a reliable GCS setup I'd honestly lean towards dropping this entirely.

cc @kubevalet AFAIU we already run on gcs via the filesystem store, right? Would it be viable to just recommend people to use that instead and improve documentation around that? I'm trying to understand the tradeoff that we'd be making here in terms of maintainability vs maintenance overhead. Are there usecases where one "has" to use GCS?

@amankrx Maybe my concerns are unreasonable. Have you explored ways to make this... well let's say "less code"?

Reviewed 19 of 19 files at r1, all commit messages.
Reviewable status: 0 of 2 LGTMs obtained, and all files reviewed, and 12 discussions need to be resolved


-- commits line 2 at r1:
"Add" instead of "Added"


nativelink-store/Cargo.toml line 31 at r1 (raw file):

bytes = { version = "1.10.0", default-features = false }
bytes-utils = { version = "0.1.4", default-features = false }
chrono = "0.4.39"

Add this (and the other creates) with x = { version = "y", default -features = false } by default to prevent pulling in unnecessary deps.


nativelink-store/src/gcs_client/client.rs line 84 at r1 (raw file):

    pub async fn new_with_http_client(
        spec: &GcsSpec,
        client: hyper::Client<C>,

fyi this will need a hyper 1.x implementation after #1639. It might be easier to build on top of that so that we don't need migrate this one as well. The hyper 1.x migration is a bit complicated and we're still testing, but it should make it into main in the next few days.


nativelink-store/src/gcs_client/client.rs line 229 at r1 (raw file):

            return Err(make_err!(
                match status.as_u16() {
                    401 | 403 => Code::PermissionDenied,

nit: Do these have typed names?


nativelink-store/src/gcs_client/client.rs line 315 at r1 (raw file):

        }

        tokio::time::sleep(Duration::from_millis(200)).await; // Small delay to allow GCS to finalize

Let's make this a named const.


nativelink-store/src/gcs_client/client.rs line 642 at r1 (raw file):

        match status.as_u16() {
            200 | 201 | 308 => Ok(()),

nit: as above


nativelink-store/src/gcs_client/operations.rs line 14 at r1 (raw file):

// See the License for the specific language governing permissions and
// limitations under the License.

nit: Seems unnecessary to have this in a dedicated file.


Cargo.lock line 42 at r1 (raw file):

[[package]]
name = "android_system_properties"

Whatever causes these android_* dependencies to be pulled in should probably be disabled. My guess is that it's some default feature in some cratej (chrono maybe?). Try adding the deps as somedep = { version = "x.y.z", default-features = false, features = ["somefeature"] } in Cargo.toml where you only enable the minimal set of features that we need.


nativelink-store/src/gcs_store.rs line 42 at r1 (raw file):

};

struct ConnectionPool {

fyi We'll probably need to rework this as we're moving the entire nativelink-store setup to hyper 1.x (Current WIP branch at #1639). I general I like the explicitness of this approach though and I'll check whether this is something I should add to the aws store as well.


nativelink-store/src/gcs_store.rs line 204 at r1 (raw file):

    NowFn: Fn() -> I + Send + Sync + Unpin + 'static,
{
    async fn has_with_results(

This looks duplicated. We might get away with lifting this out of the aws implementation and reusing it in both places.


nativelink-store/src/gcs_store.rs line 272 at r1 (raw file):

        // For larger files, use resumable upload
        // First, initiate the upload session
        let upload_id = self

note to self: This looks wrong to me. I can't pinpoint it though. Will to evaluate in more detail.


nativelink-store/src/gcs_store.rs line 275 at r1 (raw file):

            .retrier
            .retry(unfold(object.clone(), move |object| async move {
                let conn = match self.connection_pool.acquire().await {

Hmm I'm not entirely sure on this one but I feel like the connection acquisition and retrial of that should be handled via the client side. Otherwise we'd have to "remember" to always acquire the permit in the calls. And it introduces fairly tight coupling between the store implementation and the underlying client.


nativelink-store/src/gcs_store.rs line 436 at r1 (raw file):

            .retrier
            .retry(unfold(
                (object.clone(), offset, end_offset),

I believe all of these xx.clone() calls in the unfolds (here and everywhere else) should be moves. It shouldn't be necessary to clone the objects or at least only do so in the minimal cases necessary.


nativelink-store/src/gcs_client/connector.rs line 83 at r1 (raw file):

        Pin::new(&mut Pin::get_mut(self).connection).poll_shutdown(cx)
    }
}

fyi this will need to be changed after the hyper 1.x migration mentioned earlier.


.gitignore line 6 at r1 (raw file):

.idea/
.zed
.idea/

This is duplicated from two lines above. Should be safe to remove.


nativelink-store/src/gcs_client/auth.rs line 35 at r1 (raw file):

#[derive(Debug, Serialize)]
struct JwtClaims {

Is it actually necessary to roll our own implementation here? Are there reasons we can't use e.g. https://github.com/googleapis/google-cloud-rust/tree/main/src/auth or https://github.com/yoshidan/google-cloud-rust/tree/main/foundation/auth ?

For instance, if we used the official sdk, if we wanted to move the gcs store to non-experimental we could just wait until the official rust sdk reaches GA and then "bump a dependency".

I'm also worried that rolling our own here would require us to maintain perfect coverage and fuzzing as it would be "on us" to maintain the security guarantees of this implementation. This seems like a big lift. cc @kubevalet

Copy link
Collaborator Author

@amankrx amankrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get this done with less code altogether. The primary issue is that we don't have a proper GCS SDK with a custom choice of client to use. So, the majority of the implementation is primarily the client itself. I'm not sure if we would be changing the client that much later on. So, it's up to us if we want a custom client like the one that the one I have provided, or to use third-party ones like https://github.com/abdolence/gcloud-sdk-rs
Also, it was initially less than 1000 lines of code PR, but the primary overhead was implementing traits and associated methods to add the unit tests. Apart from that, if it is absolutely necessary, I can work on reducing the unnecessary addition of extra modularity and concise it a bit.

Reviewable status: 0 of 2 LGTMs obtained, and all files reviewed, and 12 discussions need to be resolved


-- commits line 2 at r1:

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

"Add" instead of "Added"

Sure. I will fix the commit message before the final merge.


.gitignore line 6 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

This is duplicated from two lines above. Should be safe to remove.

Ohh... This was added a while ago when I had raised the PR with the gRPC implementation. We didn't have .idea added at that time. I will remove it now.


Cargo.lock line 42 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

Whatever causes these android_* dependencies to be pulled in should probably be disabled. My guess is that it's some default feature in some cratej (chrono maybe?). Try adding the deps as somedep = { version = "x.y.z", default-features = false, features = ["somefeature"] } in Cargo.toml where you only enable the minimal set of features that we need.

Yeah, looks like chrono added a number of crates unnecessarily. I will disable most of the features.


nativelink-store/src/gcs_store.rs line 204 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

This looks duplicated. We might get away with lifting this out of the aws implementation and reusing it in both places.

Yup, this is the same implementation we have for AWS. The primary implementation is what we have for the has method above. I would also argue that we can reuse the update and get_part methods, and the primary implementation can be shifted to a similar method we have as the has method.


nativelink-store/src/gcs_client/auth.rs line 35 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

Is it actually necessary to roll our own implementation here? Are there reasons we can't use e.g. https://github.com/googleapis/google-cloud-rust/tree/main/src/auth or https://github.com/yoshidan/google-cloud-rust/tree/main/foundation/auth ?

For instance, if we used the official sdk, if we wanted to move the gcs store to non-experimental we could just wait until the official rust sdk reaches GA and then "bump a dependency".

I'm also worried that rolling our own here would require us to maintain perfect coverage and fuzzing as it would be "on us" to maintain the security guarantees of this implementation. This seems like a big lift. cc @kubevalet

As I mentioned above, we can definitely use these crates, but they tend to pull in the reqwest library and we can't turn off that feature right away. If we decide to use that library, we wouldn't have to worry about the GCS client implementation though.


nativelink-store/src/gcs_client/client.rs line 315 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

Let's make this a named const.

Sure!


nativelink-store/src/gcs_client/operations.rs line 14 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

nit: Seems unnecessary to have this in a dedicated file.

Can be moved to client.rs.

@aaronmondal
Copy link
Contributor

@amankrx Ah I think I get now where the aversion to reqwest is coming from 😅 You're probably referring to this: #1547 (review). What I meant by that is that we shouldn't directly depend on reqwest as we have a bunch of other mechanisms to make requests. If the google-cloud-sdk pulls that in it'd be fine as it would be similar to the aws sdk where we simply cannot control what official packages require. It's also possible to minimize the transitive dependencies by e.g. overriding the default features (although I wouldn't even consider it that problematic as long as we don't pull in any complex C/C++ dependencies).

So if we can get auth with the official sdk, even if that pulls in reqwest transitively it'd be totally fine as it's not us directly depending on reqwest.

This should significantly simplify this setup (and alleviate my security concerns as we can essentially treat it as "not our problem" if the official sdk has issues 😆 This should help us "remove" the current auth.rs file.

If the new client implementation in #1639 is generic enough we should be able to remove the connector.rs from this PR as well (might need some adjustments and we don't actually know whether it works yet, but let's hope it does 🤞 - actually, did we ever try to point the aws implementation to GCS? Since we're doing REST now anyways, couldn't we potentially use almost 1:1 the aws store?).

Regarding the current config.rs the existing aws implementation will be changed/extended to support more methods anyways because of some requirements from #1630. This migtht allow us to consolidate this to some degree with something like a more generic "cloudSpec" or "authSpec" or something along these lines that we can reuse across the aws and gcp implementations.

There doesn't seem to be any easy way around the mock.rs implementation, but since that's test code I'd say it's fine.

With changes along these lines I'm optimistic that we can significantly reduce the maintenance overhead of the GCSStore.

@MarcusSorealheis
Copy link
Collaborator

@amankrx the commit messages are identical @amankrx. Also, as a heads up you will need to squash the commits.

@amankrx
Copy link
Collaborator Author

amankrx commented Apr 20, 2025

@amankrx the commit messages are identical @amankrx. Also, as a heads up you will need to squash the commits.

Yup! I have added an extra commit now, such that it would be easier if we do revert the changes. The current implementation is based on the google_cloud_store and google_cloud_auth crates. The previous commit is the manual implementation without using these crates.
Before merging the PR, I will squash the commits into a single commit.

Also, there might be some CI failures. I will address some of the @aaronmondal 's comments as well by today. I just wanted to push these changes to get an idea if the current implementation seems to be in the right direction. I tried using AWS's SDK for GCS auth and stuff, but there were always some challenge with the trait bounds, or some other issue. If we really need that implementation, then I can dive a bit more deeper into it, but until then, please let me know if this implementation seems good enough?
Also, regarding the large lines of code, I would say most of these are tests and mock implementations, the actual client and store implementation doesn't seem to be that large.

Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @amankrx and thank you for being so patient with all the reviews! This was certainly a pretty difficult one with a bunch of back and forth and I feel like we now have a strong baseline for future stores as well ❤️

:lgtm:

Reviewed 2 of 22 files at r5, 9 of 12 files at r10, 2 of 2 files at r11, all commit messages.
Reviewable status: 2 of 2 LGTMs obtained, and all files reviewed, and 1 discussions need to be resolved

@aaronmondal
Copy link
Contributor

@amankrx (just needs a rebase after f6269d1

@amankrx
Copy link
Collaborator Author

amankrx commented May 1, 2025

Done! Thanks everyone for such a detailed and awesome review process!

@amankrx amankrx force-pushed the rest-google-cloud-store branch from b45617f to d5a0a08 Compare May 1, 2025 19:20
@aaronmondal aaronmondal enabled auto-merge (squash) May 1, 2025 19:47
Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 4 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: :shipit: complete! 2 of 2 LGTMs obtained, and all files reviewed

@aaronmondal aaronmondal merged commit 2839470 into TraceMachina:main May 1, 2025
37 checks passed
@amankrx amankrx deleted the rest-google-cloud-store branch May 1, 2025 19:59
MarcusSorealheis pushed a commit to MarcusSorealheis/nativelink that referenced this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

💎 Implement Google Cloud Bucket Store

5 participants