Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HeadObjectError in S3: no providers in chain provided credentials #428

Closed
mhz5 opened this issue Nov 28, 2023 · 13 comments
Closed

HeadObjectError in S3: no providers in chain provided credentials #428

mhz5 opened this issue Nov 28, 2023 · 13 comments
Assignees

Comments

@mhz5
Copy link
Contributor

mhz5 commented Nov 28, 2023

It might be worth having CI perform a dev native-link deployment in AWS, and point some builds at it. (nightly?)
To prevent regressions in the AWS deployment.

Deployed native-link on AWS (instructions) at 4cc53bc and attempted to point a build at the deployment. Encountered following error:

bazel test //:dummy_test \
    --remote_cache=grpcs://cas.DOMAIN \
    --remote_executor=grpcs://scheduler.DOMAIN \
    --remote_instance_name=main

INFO: Invocation ID: d4859976-5154-434c-91f9-3537d8ef7d40
INFO: Analyzed target //:dummy_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
ERROR: /home/ubuntu/native-link/BUILD.bazel:42:8: Executing genrule //:dummy_test_sh failed: (Exit 34): UNAVAILABLE: Unhandled HeadObjectError in S3: Unhandled { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: CredentialsNotLoaded(CredentialsNotLoaded { source: "no providers in chain provided credentials" }), connection: Unknown } }), meta: ErrorMetadata { code: None, message: None, extras: None } }, retries: 7 : Failed to run has() on slow store : Inner store get in compression store failed : Compression underlying store get failed : --- : Received erroneous partial chunk: Error { code: Internal, messages: ["Writer was dropped before EOF was sent"] } : During first buf_channel::take() : Failed to read header in get_part compression store : Failed to get_part in get_part_unchunked : --- : Received erroneous partial chunk: Error { code: Internal, messages: ["Writer was dropped before EOF was sent"] } : Failed to recv first chunk in collect_all_with_size_hint : Failed to read stream to completion in get_part_unchunked
INFO: Elapsed time: 4.626s, Critical Path: 3.88s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
//:dummy_test FAILED TO BUILD

Executed 0 out of 1 test: 1 fails to build.

@mhz5
Copy link
Contributor Author

mhz5 commented Nov 28, 2023

@allada @aaronmondal
If you have a minute, any pointers on diagnosing this?
If not I'll just poke around on my own time.

@mhz5
Copy link
Contributor Author

mhz5 commented Nov 28, 2023

Succeeds at 56eda36

bazel test //:dummy_test     --remote_cache=grpcs://cas.DOMAIN     --remote_executor=grpcs://scheduler.DOMAIN     --remote_instance_name=main

@aaronmondal aaronmondal self-assigned this Nov 29, 2023
@aaronmondal
Copy link
Contributor

@mhz5 Is this still an issue after #423 ? @allada added some an additional config options in #421 for the endpoint url, though I'm not sure whether that's actually related. Seems like it can't find the aws credentials.

Note to self: It seems like we can now also add support for SSO: awslabs/aws-sdk-rust#703.

@allada
Copy link
Collaborator

allada commented Nov 29, 2023

Yeah, can you tell us what checkout you are using? It may have been fixed in recent changes.

We have not had time to deep-dive the AWS config again to verify everything, because we are in the process of releasing a GCP variation and a pinned version release.

We will be making more changes to AWS config in the coming months to support rule-based auto-scaling policies, so be on the lookout for those too.

@mhz5
Copy link
Contributor Author

mhz5 commented Nov 29, 2023

@mhz5 Is this still an issue after #423 ?

Yes this occurs even at #423.
I updated the description to include the deployed commit.
Thanks for taking a look guys, I will dig deeper.
Looking forward to the pinned release!

@mhz5
Copy link
Contributor Author

mhz5 commented Nov 30, 2023

This regression occurred with this PR:
#369

This code section

@aaronmondal
Copy link
Contributor

Hmm I just played around with things and can't reproduce this issue. I can imagine two things going wrong here:

The new s3 implementation enforces HTTPS by default and now requires explicit setting of the insecure_allow_http flag for HTTP traffic. The error message if this is not set look something like this:

[2023-12-01T11:53:19.521Z WARN  aws_smithy_runtime::client::http::hyper_014] unrecognized error from Hyper. If this error should be retried, please file an issue. err=error t
rying to connect: Error { code: Unavailable, messages: ["Failed to call S3 connector: Custom { kind: Other, error: \"Unsupported scheme http\" }, retries: 7"] }: Error { code
: Unavailable, messages: ["Failed to call S3 connector: Custom { kind: Other, error: \"Unsupported scheme http\" }, retries: 7"] } (hyper::Error(Connect, Error { code: Unavai
lable, messages: ["Failed to call S3 connector: Custom { kind: Other, error: \"Unsupported scheme http\" }, retries: 7"] }))

In this case the fix would be to set that setting to true in the config.json.

Otherwise it might just be a difference in the way you pass the credentials to the store. The aws SDK uses a different config detection mechanism as the previous implementation. In my (working) case, I have credentials at ~/.aws/config/~/.aws/credentials.

I assumed that https://github.com/TraceMachina/native-link/blob/3ec203b9c17e8e4dfa7160f74e948c64e542de16/native-link-store/src/s3_store.rs#L152C11-L152C11 correctly forwards environment variables to the S3 store if they're set. If this is not the case we might need to make the detection of the credential environment variables explicit.

@allada
Copy link
Collaborator

allada commented Dec 1, 2023

Credentials can be used from a variety of sources. On AWS it is likely coming from the service account associated attached to the instance, which comes from "169.254.169.254" (iirc). It is possible that it is unable to resolve these credentials and use them. Doing what @aaronmondal said and setting them in ~/.aws/credentials to validate the problem would help (remember to set it on the same user that runs NativeLink [probably root]).

It is possible that the new S3 SDK does not support credentials provider (I personally have not tested yet). If it is the case, we can find a work-around.

@mhz5
Copy link
Contributor Author

mhz5 commented Dec 1, 2023

In my (working) case, I have credentials at ~/.aws/config/~/.aws/credentials.

I deployed native-link on AWS via terraform apply, and ssh'd into the scheduler, worker, and cas VMs.
I don't believe I saw the ~/.aws directory on any of these VMs (although I can double check).
@aaronmondal should ~/.aws exist on these VMs?
I'm under the impression that the AWS SDK gets credentials from the IAM roles on these VMs.

@allada
Copy link
Collaborator

allada commented Dec 8, 2023

Sorry for the delay on this. We are trying to get a release pinned and have some priority thrashing. This is a high priority to resolve and we'll double check everything soon.

@prestonvanloon
Copy link

prestonvanloon commented Dec 11, 2023

Is there a workaround for this? I'm also facing the same issue

@allada
Copy link
Collaborator

allada commented Dec 16, 2023

Yep, we identified that the default credentials provider does not support this and the sdk requires this be set manually. Fix is in-flight, but waiting on a regression test.

allada added a commit that referenced this issue Dec 19, 2023
S3 store is too unstable to consider it production worthy.

In the mean time we will mark it experimental until we get the bugs
fixed.

related: #491
related: #460
related: #428
adam-singer added a commit that referenced this issue Dec 21, 2023
Credential provider is not supplied when creating aws_config::from_env(), that leads to failure responses seen in #428.

Pass the aws_config::default_provider::credentials.default_provider into the aws_config::from_env() builder which should pick up the proper credentials for the environment based on the resolution order.
@aaronmondal
Copy link
Contributor

Should be fixed by #494. This change is published in the v0.2.0 release. Containers with this fix are:

docker pull ghcr.io/tracemachina/nativelink:v0.2.0

Please let us know if things still don't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants