Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving key vault secret takes 1m 40s with default credentials #23017

Closed
6 tasks
rudfoss opened this issue Aug 26, 2022 · 22 comments
Closed
6 tasks

Retrieving key vault secret takes 1m 40s with default credentials #23017

rudfoss opened this issue Aug 26, 2022 · 22 comments
Assignees
Labels
Azure.Identity Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. KeyVault question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@rudfoss
Copy link

rudfoss commented Aug 26, 2022

  • Package Name: @azure/keyvault-secrrets
  • Package Version: 4.5.1
  • Azure CLI Version: 2.34.0
  • Azure PowerShell Version: 7.2.1
  • Operating system: Windows 10
  • nodejs
    • version: 16.14.0
  • browser
    • name/version: n/a
  • typescript
    • version: 4.8.2
  • Is the bug related to documentation in

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
Repository where I've reproduced the problem: https://github.com/rudfoss/keyvault-secret-test

Expected behavior
I'm seeing some strange behavior when retrieving key vault secrets using @azure/keyvault-secrets. Currently it takes about 1m40s to request a single secret using DefaultAzureCredential. I'm signed in to the Azure CLI and Azure PowerShell. My resources are deployed to Norway East. Full repo that reproduces the problem here: https://github.com/rudfoss/keyvault-secret-test

Here is essentially what I'm trying to do:

const cred = new DefaultAzureCredential()
const secretClient = new SecretClient(process.env["KEYVAULT_URI"]!, cred)

console.time("getSecret")
const secret = await secretClient.getSecret("test-secret")
console.timeEnd("getSecret")
console.log({ secret: secret.value })

The strange thing is that it's not slow when using the same credentials to retrieve an App Configuration setting like this:

const cred = new DefaultAzureCredential()
const configClient = new AppConfigurationClient(process.env["APPCONFIG_URI"]!, cred)

console.time("getConfig")
const configSetting = await configClient.getConfigurationSetting({ key: "test-value" })
console.timeEnd("getConfig")
console.log({ configSetting: configSetting.value })

Would love some pointers on why this is slow and if there is something I can do to fix it.

@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 26, 2022
@azure-sdk azure-sdk added Client This issue points to a problem in the data-plane of the library. KeyVault needs-team-triage This issue needs the team to triage. labels Aug 26, 2022
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Aug 26, 2022
@xirzec
Copy link
Member

xirzec commented Aug 26, 2022

Interesting! Is it any faster if you use either AzureCliCredential or AzurePowerShellCredential directly? Trying to figure out where the chain is getting held up.

@rudfoss
Copy link
Author

rudfoss commented Aug 26, 2022

AzureCliCredential is pretty fast (about 4 seconds). I'd guess it's because I'm connecting to a Key Vault in East US from Norway.
AzurePowerShellCredential is about the same (5-ish seconds).

@rudfoss
Copy link
Author

rudfoss commented Aug 26, 2022

I've added logs from one of the slow runs here: https://github.com/rudfoss/keyvault-secret-test/blob/main/logs.txt

@timovv
Copy link
Member

timovv commented Aug 26, 2022

Hey @rudfoss, thanks for the logs and detailed repro!

What environment are you running this on? Is it a local machine or somewhere in Azure (a VM, Azure Cloud Shell...)? It appears that DefaultAzureCredential is trying to authenticate with Managed Identity first and is failing. I can't deduce the reason from the logs alone, but maybe the Managed Identity service principal doesn't have the correct permissions to retrieve the secret -- and even if you aren't intending to use Managed Identity, DefaultAzureCredential is trying anyway.

Your logs aren't timestamped, but it looks like there are a number of retries of the Managed Identity request. I wonder if the delay between retries is what's causing the slowness?

@rudfoss
Copy link
Author

rudfoss commented Aug 26, 2022

Hi @timovv. Thanks for the reply.

I'm running the code locally from my machine (win 10, node 16) connecting to a key vault in norway and east us, both set up with RBAC auth and my user assigned as Key Vault admin. I've installed the Azure CLI and Azure PowerShell and I'm signed in to both. I don't have the versions in front of me atm, but I'll add them tomorrow.

The authentication eventually works and the code retrieves the secret value. What is strange is that using the exact same or a new default credential with an App Configuration instance works instantly. There seems to be something specific to the key vault or keyvault-secrets library that I'm not understanding.

@timovv
Copy link
Member

timovv commented Aug 26, 2022

What you're describing, with the credential working fine for App Configuration, does strike me as odd. Just to rule it out: when you make the fast call to getConfigurationSetting (or whatever App Configuration client method you're using), are you doing that after already making a call to secretClient.getSecret? If you're running them one after the other, it's possible that Identity is doing some caching. I'd also be curious if whether subsequent calls to secretClient.getSecret are faster than the first call.

@rudfoss
Copy link
Author

rudfoss commented Aug 27, 2022

Edit: I did a retest this morning, altering the order and instances.

Credentials First Second Result
Shared DefaultAzureCredentials ✅Config 🔻Secret Config is fast, secret is slow (secret took about 2 mins)
Shared DefaultAzureCredentials 🔻Secret 🔻Config Both are equally slow (about 2 mins each)
Separate DefaultAzureCredentials instances ✅Config 🔻Secret Config is fast, secret is slow (secret took about 2 mins)
Separate DefaultAzureCredentials instances 🔻Secret ✅Config Secret is slow, config is fast (secret took about 2 mins)

So it appears there is some caching going on...

@rudfoss
Copy link
Author

rudfoss commented Aug 27, 2022

I'm stepping through the client core code atm and it seems there are several requests to:

169.254.169.254:80

Which eventually fails with connect ETIMEDOUT 169.254.169.254:80

A bit of googling turned up this:
Of particular note, 169.254.169.254 is used in AWS, Azure and other cloud computing platforms to host instance metadata service.

Is the client believing that I'm running this in an Azure VM? If so how is it determining this for the key vault client and not the app config client?

Edit: Still more debugging reveals the delay happens within the ManagedIdentityCredential. I've narrowed it down to getToken -> authenticateManagedIdentity -> cachedAvailableMSI. Still debugging...

Edit2: The cachedAvailableMSI call is pretty slow (~10 seconds), but the real delay comes from the availableMSI.getToken call just below it which comes from an imdsMsi constant object.

@rudfoss
Copy link
Author

rudfoss commented Aug 27, 2022

Ok, so I think I may have found at least an issue, and I also think it may be indirectly responsible for my problem.

The ManagedIdentityCredential eventually ends up calling isAvailable on this line:

On my machine this results in a request to http://169.254.169.254/metadata/identity/oauth2/token which eventually times out. Once it does the code proceeds to line 159 here:

The if at this point checks for several known errors but does not check for a timeout ETIMEDOUT. The result is that isAvailable returns true for my machine (and I suspect in a lot of cases) where there is something on that ip responding, but it's slow and times out.

I don't understand why ETIMEDOUT should make this MSI return true for availability, but that is the case at this time.

In any case, the consequence is that ManagedIdentityCredential will attempt to getToken using imdsMsi.getToken which requests it from the 169.254.169.254 host. This host does not reply and eventually times out. However getToken is set up to retry 3 times causing it to "hang" for timeout x retries delay. I believe this is the delay I've been seeing. Eventually it gives up and tags the MSI implementation as unavailable which causes it to fall back to the AzureCliCredential that works.

I've not yet uncovered why App Configuration behaves differently from Key Vault secrets, but I'll keep looking.

Edit:
I've traced calls back to the same point using the App Confguration clients getConfigurationSetting method. It ends up at the same line here:

However, this time the request within isAvailable is immediately aborted. Apparently @azure/app-configuration AppConfigurationClient does something to the configuration... still searching...

Edit2:
Ok, so the abort is a timeout of 300ms that is set by the isAvailable method here:

request.timeout = options.requestOptions?.timeout ?? 300;

This timeout is in effect for the App Configuration client because it tries to getToken before making a request. The SecretClient however seems to try and fetch the client, gets a 401 response and only THEN tries to getToken. And because a previous request is already in the pipeline the timeout property from that is used instead of the default 300 from the line above. A timeout of 0 seems to fall back to some internal default which is long, but not infinite and that is then used when the request for http://169.254.169.254/metadata/identity/oauth2/token is sent.

I haven't really understood the internal implementation here so I don't know exactly why a timeout value would be reused between requests, but that seems to be what's going on here.

My short-term, extremely dirty fix for the issue is to duck-punch the getToken method of the ManagedIdentityCredential to remove the timeout parameter if it is 0.

With two small helper functions I can minimize the impact and make the credential work the same way for the key vault and for app configurations.

const resetTimeoutOnGetToken = (credential: TokenCredential) => {
  const original = credential.getToken
  credential.getToken = ((arg1, arg2, ...restArgs: any[]) => {
    if (arg2?.requestOptions?.timeout === 0) {
      delete arg2?.requestOptions.timeout
    }
    return original.call(credential, arg1, arg2, ...restArgs)
  }) as any

  return () => {
    credential.getToken = original
  }
}
const findManagedIdentityCredential = (chainedTokenCredential: ChainedTokenCredential): ManagedIdentityCredential | undefined => {
  return chainedTokenCredential["_sources"].find((credential) => credential instanceof ManagedIdentityCredential)
}

The above functions allow me to do this before I pass the credential to a client:

const cred: any = new DefaultAzureCredential()
const managedCred = findManagedIdentityCredential(cred)
if (managedCred) {
  resetTimeoutOnGetToken(managedCred)
}

It is in no way clean or correct, but it works for now. I'd love to know why the credential instances behave this way and why the timeout carries over though... It seem a bit weird to me, but I'm sure there is a reason.

@xirzec
Copy link
Member

xirzec commented Aug 27, 2022

I'll dig more into this on Monday, but since you were so wonderful to have debugged this so extensively, I wanted to point out one special thing about keyvault: namely that it uses challenge based authentication:

* Key Vault supports other authentication schemes, but we ensure challenge authentication

This causes it to force receiving the challenge by sending a body-less request, which could explain the extra request you are seeing.

I'm also deeply suspicious of the fact we're not handling ETIMEDOUT in that code path for imds availability.

@rudfoss
Copy link
Author

rudfoss commented Aug 29, 2022

Thanks for the reply @xirzec , I didn't mean for people to work weekends on this. I just found it interesting enough to tinker with on the side and wanted to document my findings. I'm perfectly happy to wait for a resolution here, especially now that I've got my hacky workaround. Let me know if I can assist with debugging this.

I'm curious why there would need to be a blacklist of errors here though instead of simply failing outright. I guess there is an edge case where an erroneous response indicates that the service is still there though that seems odd.

Also regarding the timeout I'm curious why there would be a default 300ms delay for this that is overridable. I'd presume the IMDS service would know what service it is calling and what would be the expected return time so that it should "own" the timeout value itself. It is after all calling an endpoint that the code using the library had no influence over at all. I'm probably missing a lot of complexity here though...

In any case. Thanks a lot for helping me with this!

@xirzec
Copy link
Member

xirzec commented Aug 29, 2022

Thanks for the reply @xirzec , I didn't mean for people to work weekends on this. I just found it interesting enough to tinker with on the side and wanted to document my findings. I'm perfectly happy to wait for a resolution here, especially now that I've got my hacky workaround. Let me know if I can assist with debugging this.

No worries! I was very curious about this one so I couldn't help but look at it again briefly on Saturday.

I'm curious why there would need to be a blacklist of errors here though instead of simply failing outright. I guess there is an edge case where an erroneous response indicates that the service is still there though that seems odd.

Yes, this feels extra odd since the way we are making this request means it shouldn't be throwing if it gets back any kind of valid HTTP response (even if that response is a 4xx, 5xx, etc.) I'm very tempted to remove these checks and treat any throw as permanent failure.

Also regarding the timeout I'm curious why there would be a default 300ms delay for this that is overridable. I'd presume the IMDS service would know what service it is calling and what would be the expected return time so that it should "own" the timeout value itself. It is after all calling an endpoint that the code using the library had no influence over at all. I'm probably missing a lot of complexity here though...

My best guess here is that it's highly unusual for a custom timeout to be passed to getToken since it usually is called implicitly inside clients, so in the off chance that someone was calling getToken manually and passing a custom timeout we wanted to honor it.

However, I think things are getting a bit off here since the challenge based auth policy is blindly copying over the timeout, which on the request will always default to 0 when not set, which does not mean that the consumer explicitly passed 0.

@xirzec
Copy link
Member

xirzec commented Aug 29, 2022

@rudfoss have an attempt at a fix in #23035 if you want to take a look! A lot of the folks who I want to review this are out this week, so this may have to ship in October if your workaround can hold that long. 😄

@xirzec xirzec removed the needs-team-triage This issue needs the team to triage. label Aug 29, 2022
@rudfoss
Copy link
Author

rudfoss commented Aug 30, 2022

@xirzec Honestly, I expected this to be a problem in my code so having it addressed this quickly is great! I'll leave my workaround in place and remove it once a new release is available. The PR looks good to me at least :)

Thanks for looking at this so quickly! I really appreciate it.

@dhensby
Copy link
Contributor

dhensby commented Sep 6, 2022

I have run into this problem locally too. It looks like the cause has been correctly diagnosed in the linked PR. The timeout value when "pinging" the IMDS endpoint is set to 0 when I turn logging on, meaning it takes a very long time for this "ping" to fail.

What is odd is that I'm finding this behaviour to be intermittent. Sometimes I'll get a response quite quickly (the "ping" will fail fast) and sometimes it takes a long time.

I also note that I see this in the logs:

azure:identity:info ManagedIdentityCredential - IMDS => ManagedIdentityCredential - IMDS: Using the Azure IMDS endpoint coming from the environment variable MSI_ENDPOINT=undefined, and using the cloud shell to proceed with the authentication.

Should one expect that the IMDS checks should be skipped if no MSI_ENDPOINT env var is set?

@xirzec
Copy link
Member

xirzec commented Sep 12, 2022

@dhensby that log statement seems very suspicious given that the imds flow doesn't seem to use MSI_ENDPOINT, but AZURE_POD_IDENTITY_AUTHORITY_HOST instead:

const url = new URL(imdsEndpointPath, process.env.AZURE_POD_IDENTITY_AUTHORITY_HOST ?? imdsHost);

Perhaps this has gotten a bit stale or was copied from another credential type?

@xirzec
Copy link
Member

xirzec commented Sep 12, 2022

@dhensby I went ahead and tried updating the log line to be more helpful: e406b94

Appreciate any feedback you have!

@dhensby
Copy link
Contributor

dhensby commented Sep 13, 2022

@xirzec ah, that makes sense and the update looks good. 👍

@aulonm
Copy link

aulonm commented Sep 21, 2022

We had the same problem as rudfoss. Version 4.6.0 made it faster than 4.5.0, so we are happy and can update our dependency again from 4.4.0.

Thanks for the fix!

@xirzec
Copy link
Member

xirzec commented Sep 21, 2022

I should mention that 3.0.0 of @azure/identity is out now with the identity-side fixes too!

@adomrockie
Copy link

The fix for me was combination of updating @azure/identity and @azure/keyvault-secrets to latest versions

@rudfoss
Copy link
Author

rudfoss commented Nov 10, 2022

I'm closing this now since a fix has been released. Thanks for the help!

@rudfoss rudfoss closed this as completed Nov 10, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Identity Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. KeyVault question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

7 participants