-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Large numbers of concurrent token refresh attempts cause a cache refresh convoy resulting in chronic 429 errors #4196
Comments
hi, |
@damiand2 - are you using Managed Identity via Azure.SDK or via MSAL ? |
indirectly via Microsoft.Data.SqlClient which uses Microsoft.Identity.Client - so i guess MSAL |
@damiand2 - can you provide a few more details about your system? @gladjohn - did the SQL folks integrate with MI via MSAL.NET ? I guess it doesn't matter, if they use Azure Identity, they still rely on MSAL for the token caching. |
@christothes How long was the retry wait in that scenario? |
If the client passes a cancellation shorter than the block time, the requests will timeout. So the client will still need to have a retry policy. If they don't have a retry policy, then requests will end with a cancellation exception instead 429. |
@bgavrilMS - standard .net core web app that uses MS SQL as storage. As for solution you described - depending on what you mean by 'block similar requests to the token issuer until one succeeds' - if by block you mean task block with request to DB then we would be replacing one kind of problem with another one that to external code (my code) differs in nothing - we would still get very long request handling times/timeouts once 24 hours (basically meaning app is unavailable for a minute or two every 24h). If i may propose something then it would be better to spin a single request for new MI token few minutes before old one expires and allow all standard calls (here to DB) to use old token and that request (once finished) should silently replace old MI token in cache. |
AFAIK, SQL Client only uses PCA from MSAL. And they seem to be using Azure.Identity for MI scenarios. |
We need both. When the cache is empty and a bunch of threads rush to fetch the MI token, they will throttle the token issuing service. But then, when a token is in the cache, MSAL should re-new it at 1/2 interval, as long as you keep requesting a token, in a background thread. The first is not implemented. The latter is - it was added in MSAL 4.54 (changelog here) Can you try to upgrade to use MSAL |
@bgavrilMS - just to make sure - when you wrote "Can you try to upgrade to use MSAL 5.54 or higher?" - you meant 4.54, right? |
ok, we will try to update MIcrosoft.Identity.Client package manually and see it if helps. I will let you know asap. |
Updating just MIcrosoft.Identity.Client package didn't help |
@bgavrilMS - hi again, we manually upgraded Microsoft.Identity.Client.Extensions.Msal to 4.56 in our project, but problem did not go away - we still see torrent of concurrent MI token requests when old one expires. |
I believe Azure SDK might need to do some integration work @gladjohn ? |
@christothes - do you have some test setup to validate / invalidate this fix? FYI - the fix was to add some semaphores https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/blob/main/src/client/Microsoft.Identity.Client/Internal/Requests/ClientCredentialRequest.cs#L120 |
@bgavrilMS - i could arrange some logs, you would just have to instruct me how to obtain them, i cannot set loglevel to Debug for every source, there would be too much data from production. |
MSAL does logging in its own way, see https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/wiki/logging (mostly because MSAL is low level library, it does not depend on any logging infra) |
@bgavrilMS - unfortunately we are not creating our own msal client instances, we depend on AzureAD implementation (via MIcrosoft.Data.SqlClient and providing in connection string entry like Authentication=Active Directory Managed Identity;User Id=xxxx-xxxxx-xxxx-xxxx;). Is there a way to enforce MSAL logging in such scenario? |
If you have flexibility to try a newer version of SQLClient, we added a new preview feature to SqlClient version 5.2.0-preview3 that gives you full control of the credential provider via a callback. |
Ah, MSAL is so many layers down that I am not sure if MSAL infrastructure for token caching is used here and if the fix applies by just taking direct dependency to MSAL. The callback approach might work - i.e. you just use MSAL directly (or Azure Identity) to get a token. |
@christothes - ok, i'm potentially willing to test out 5.2.0-preview3 - should i use that callback to enable more logging in msal client? Or do you have something else in mind? |
Which ever is easier for your scenario. Pure MSAL would be the best direct test, but you can also enable logging for Azure.Identiy (which enabled MSAL logging also) via https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/identity/Azure.Identity#logging |
@christothes - unfortunately SqlClient 5.2.0-preview3 is quite unstable (many random connection resets when connecting to db) so we cannot use it on production. I have to wait until SqlClient gets official release and let you know, unless you got other ideas how to test it earlier? |
You may be able to achieve the same result with this: https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.sqlclient.sqlauthenticationprovider?view=sqlclient-dotnet-standard-5.1 |
We consider this fixed. |
Describe the bug
In a scenario in which an application does the following:
WithAppTokenProvider
with a callback configured to fetch tokens from a managed identity endpointGetToken
requests simultaneouslyMaxRetries
in the HttpClient pipeline used to fetch tokensWhen such an application encounters a 429 response from the MI endpoint, this can result in a storm of requests and retry requests making the 429 problem worse.
In addition, given the current behavior in MSAL for retries and cache access, all retries are guaranteed to result in a cache miss and will continue to fail as long as the MI endpoint does not return a successful token response.
Expected behavior
Retry attempts after the token cache is successfully refreshed should succeed via a cache hit rather than through a network request to the MI endpoint or authority. Only one request should be made to the endpoint to refresh the cache for any given cache entry and all other concurrent requests should consume that single result.
Actual behavior
Once the initial request fails with a retriable status code, all subsequent token requests do not attempt to read the cache and always result in an additional network request.
Reproduction Steps
Issue a large number of simultaneous
GetToken
requests with aManagedIdentityCredential
to induce a 429 response from the MI endpointEnvironment
Customer example was in Service Fabric, but this should reproduce in any managed identity environment in which a 429 response is possible.
The text was updated successfully, but these errors were encountered: