HeadObjectError in S3: no providers in chain provided credentials #428

mhz5 · 2023-11-28T17:40:42Z

It might be worth having CI perform a dev native-link deployment in AWS, and point some builds at it. (nightly?)
To prevent regressions in the AWS deployment.

Deployed native-link on AWS (instructions) at 4cc53bc and attempted to point a build at the deployment. Encountered following error:

bazel test //:dummy_test \
    --remote_cache=grpcs://cas.DOMAIN \
    --remote_executor=grpcs://scheduler.DOMAIN \
    --remote_instance_name=main

INFO: Invocation ID: d4859976-5154-434c-91f9-3537d8ef7d40
INFO: Analyzed target //:dummy_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
ERROR: /home/ubuntu/native-link/BUILD.bazel:42:8: Executing genrule //:dummy_test_sh failed: (Exit 34): UNAVAILABLE: Unhandled HeadObjectError in S3: Unhandled { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: CredentialsNotLoaded(CredentialsNotLoaded { source: "no providers in chain provided credentials" }), connection: Unknown } }), meta: ErrorMetadata { code: None, message: None, extras: None } }, retries: 7 : Failed to run has() on slow store : Inner store get in compression store failed : Compression underlying store get failed : --- : Received erroneous partial chunk: Error { code: Internal, messages: ["Writer was dropped before EOF was sent"] } : During first buf_channel::take() : Failed to read header in get_part compression store : Failed to get_part in get_part_unchunked : --- : Received erroneous partial chunk: Error { code: Internal, messages: ["Writer was dropped before EOF was sent"] } : Failed to recv first chunk in collect_all_with_size_hint : Failed to read stream to completion in get_part_unchunked
INFO: Elapsed time: 4.626s, Critical Path: 3.88s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
//:dummy_test FAILED TO BUILD

Executed 0 out of 1 test: 1 fails to build.

The text was updated successfully, but these errors were encountered:

mhz5 · 2023-11-28T17:44:26Z

@allada @aaronmondal
If you have a minute, any pointers on diagnosing this?
If not I'll just poke around on my own time.

mhz5 · 2023-11-28T18:57:33Z

Succeeds at 56eda36

bazel test //:dummy_test     --remote_cache=grpcs://cas.DOMAIN     --remote_executor=grpcs://scheduler.DOMAIN     --remote_instance_name=main

aaronmondal · 2023-11-29T08:19:37Z

@mhz5 Is this still an issue after #423 ? @allada added some an additional config options in #421 for the endpoint url, though I'm not sure whether that's actually related. Seems like it can't find the aws credentials.

Note to self: It seems like we can now also add support for SSO: awslabs/aws-sdk-rust#703.

allada · 2023-11-29T16:46:52Z

Yeah, can you tell us what checkout you are using? It may have been fixed in recent changes.

We have not had time to deep-dive the AWS config again to verify everything, because we are in the process of releasing a GCP variation and a pinned version release.

We will be making more changes to AWS config in the coming months to support rule-based auto-scaling policies, so be on the lookout for those too.

mhz5 · 2023-11-29T18:40:00Z

@mhz5 Is this still an issue after #423 ?

Yes this occurs even at #423.
I updated the description to include the deployed commit.
Thanks for taking a look guys, I will dig deeper.
Looking forward to the pinned release!

mhz5 · 2023-11-30T01:08:37Z

This regression occurred with this PR:
#369

This code section

aaronmondal · 2023-12-01T12:10:12Z

Hmm I just played around with things and can't reproduce this issue. I can imagine two things going wrong here:

The new s3 implementation enforces HTTPS by default and now requires explicit setting of the insecure_allow_http flag for HTTP traffic. The error message if this is not set look something like this:

[2023-12-01T11:53:19.521Z WARN  aws_smithy_runtime::client::http::hyper_014] unrecognized error from Hyper. If this error should be retried, please file an issue. err=error t
rying to connect: Error { code: Unavailable, messages: ["Failed to call S3 connector: Custom { kind: Other, error: \"Unsupported scheme http\" }, retries: 7"] }: Error { code
: Unavailable, messages: ["Failed to call S3 connector: Custom { kind: Other, error: \"Unsupported scheme http\" }, retries: 7"] } (hyper::Error(Connect, Error { code: Unavai
lable, messages: ["Failed to call S3 connector: Custom { kind: Other, error: \"Unsupported scheme http\" }, retries: 7"] }))

In this case the fix would be to set that setting to true in the config.json.

Otherwise it might just be a difference in the way you pass the credentials to the store. The aws SDK uses a different config detection mechanism as the previous implementation. In my (working) case, I have credentials at ~/.aws/config/~/.aws/credentials.

I assumed that https://github.com/TraceMachina/native-link/blob/3ec203b9c17e8e4dfa7160f74e948c64e542de16/native-link-store/src/s3_store.rs#L152C11-L152C11 correctly forwards environment variables to the S3 store if they're set. If this is not the case we might need to make the detection of the credential environment variables explicit.

allada · 2023-12-01T15:58:28Z

Credentials can be used from a variety of sources. On AWS it is likely coming from the service account associated attached to the instance, which comes from "169.254.169.254" (iirc). It is possible that it is unable to resolve these credentials and use them. Doing what @aaronmondal said and setting them in ~/.aws/credentials to validate the problem would help (remember to set it on the same user that runs NativeLink [probably root]).

It is possible that the new S3 SDK does not support credentials provider (I personally have not tested yet). If it is the case, we can find a work-around.

mhz5 · 2023-12-01T23:10:41Z

In my (working) case, I have credentials at ~/.aws/config/~/.aws/credentials.

I deployed native-link on AWS via terraform apply, and ssh'd into the scheduler, worker, and cas VMs.
I don't believe I saw the ~/.aws directory on any of these VMs (although I can double check).
@aaronmondal should ~/.aws exist on these VMs?
I'm under the impression that the AWS SDK gets credentials from the IAM roles on these VMs.

allada · 2023-12-08T16:05:34Z

Sorry for the delay on this. We are trying to get a release pinned and have some priority thrashing. This is a high priority to resolve and we'll double check everything soon.

prestonvanloon · 2023-12-11T21:09:29Z

Is there a workaround for this? I'm also facing the same issue

allada · 2023-12-16T03:04:26Z

Yep, we identified that the default credentials provider does not support this and the sdk requires this be set manually. Fix is in-flight, but waiting on a regression test.

S3 store is too unstable to consider it production worthy. In the mean time we will mark it experimental until we get the bugs fixed. related: #491 related: #460 related: #428

Credential provider is not supplied when creating aws_config::from_env(), that leads to failure responses seen in #428. Pass the aws_config::default_provider::credentials.default_provider into the aws_config::from_env() builder which should pick up the proper credentials for the environment based on the resolution order.

aaronmondal · 2023-12-21T23:02:42Z

Should be fixed by #494. This change is published in the v0.2.0 release. Containers with this fix are:

docker pull ghcr.io/tracemachina/nativelink:v0.2.0

Please let us know if things still don't work.

aaronmondal self-assigned this Nov 29, 2023

aaronmondal mentioned this issue Dec 8, 2023

Update aws libraries to 1.x, hyper to 1.x #460

Open

adam-singer mentioned this issue Dec 15, 2023

S3 Store Credential Provider #494

Merged

1 task

allada added a commit that referenced this issue Dec 19, 2023

[Breaking] Mark S3 store experimental

05a6dd7

S3 store is too unstable to consider it production worthy. In the mean time we will mark it experimental until we get the bugs fixed. related: #491 related: #460 related: #428

aaronmondal closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HeadObjectError in S3: no providers in chain provided credentials #428

HeadObjectError in S3: no providers in chain provided credentials #428

mhz5 commented Nov 28, 2023 •

edited

Loading

mhz5 commented Nov 28, 2023

mhz5 commented Nov 28, 2023

aaronmondal commented Nov 29, 2023

allada commented Nov 29, 2023

mhz5 commented Nov 29, 2023 •

edited

Loading

mhz5 commented Nov 30, 2023 •

edited

Loading

aaronmondal commented Dec 1, 2023

allada commented Dec 1, 2023

mhz5 commented Dec 1, 2023

allada commented Dec 8, 2023

prestonvanloon commented Dec 11, 2023 •

edited

Loading

allada commented Dec 16, 2023

aaronmondal commented Dec 21, 2023

HeadObjectError in S3: no providers in chain provided credentials #428

HeadObjectError in S3: no providers in chain provided credentials #428

Comments

mhz5 commented Nov 28, 2023 • edited Loading

mhz5 commented Nov 28, 2023

mhz5 commented Nov 28, 2023

aaronmondal commented Nov 29, 2023

allada commented Nov 29, 2023

mhz5 commented Nov 29, 2023 • edited Loading

mhz5 commented Nov 30, 2023 • edited Loading

aaronmondal commented Dec 1, 2023

allada commented Dec 1, 2023

mhz5 commented Dec 1, 2023

allada commented Dec 8, 2023

prestonvanloon commented Dec 11, 2023 • edited Loading

allada commented Dec 16, 2023

aaronmondal commented Dec 21, 2023

mhz5 commented Nov 28, 2023 •

edited

Loading

mhz5 commented Nov 29, 2023 •

edited

Loading

mhz5 commented Nov 30, 2023 •

edited

Loading

prestonvanloon commented Dec 11, 2023 •

edited

Loading