Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Large numbers of concurrent token refresh attempts cause a cache refresh convoy resulting in chronic 429 errors #4196

Closed
christothes opened this issue Jun 30, 2023 · 27 comments · Fixed by #4309

Comments

@christothes
Copy link

Describe the bug

In a scenario in which an application does the following:

  • utilizes a WithAppTokenProvider with a callback configured to fetch tokens from a managed identity endpoint
  • Issues hundreds of GetToken requests simultaneously
  • Optionally increases MaxRetries in the HttpClient pipeline used to fetch tokens

When such an application encounters a 429 response from the MI endpoint, this can result in a storm of requests and retry requests making the 429 problem worse.
In addition, given the current behavior in MSAL for retries and cache access, all retries are guaranteed to result in a cache miss and will continue to fail as long as the MI endpoint does not return a successful token response.

Expected behavior

Retry attempts after the token cache is successfully refreshed should succeed via a cache hit rather than through a network request to the MI endpoint or authority. Only one request should be made to the endpoint to refresh the cache for any given cache entry and all other concurrent requests should consume that single result.

Actual behavior

Once the initial request fails with a retriable status code, all subsequent token requests do not attempt to read the cache and always result in an additional network request.

Reproduction Steps

Issue a large number of simultaneous GetToken requests with a ManagedIdentityCredential to induce a 429 response from the MI endpoint

Environment

Customer example was in Service Fabric, but this should reproduce in any managed identity environment in which a 429 response is possible.

@damiand2
Copy link

hi,
any ETA? We are heavily impacted by this.

@bgavrilMS
Copy link
Member

@damiand2 - are you using Managed Identity via Azure.SDK or via MSAL ?

@damiand2
Copy link

damiand2 commented Aug 3, 2023

indirectly via Microsoft.Data.SqlClient which uses Microsoft.Identity.Client - so i guess MSAL

@bgavrilMS
Copy link
Member

bgavrilMS commented Aug 3, 2023

@damiand2 - can you provide a few more details about your system?
The solution we're thinking of is to block similar requests to the token issuer until one succeeds, then all others can just pick the token from MSAL's cache. This will only work if the contention is at thread level. Will it help in your scenario, i.e. do you have a bunch of threads racing to get tokens or do you have a bunch of processes?

@gladjohn - did the SQL folks integrate with MI via MSAL.NET ? I guess it doesn't matter, if they use Azure Identity, they still rely on MSAL for the token caching.

@pmaytak
Copy link
Contributor

pmaytak commented Aug 4, 2023

@christothes How long was the retry wait in that scenario?
Wouldn't a retry with a sufficiently longer wait time than however long it takes to get a token from MI work? The subsequent requests will wait in a retry while the first request gets and caches a token, then following requests will hit the cache.

@pmaytak
Copy link
Contributor

pmaytak commented Aug 4, 2023

The solution we're thinking of is to block similar requests to the token issuer until one succeeds

If the client passes a cancellation shorter than the block time, the requests will timeout. So the client will still need to have a retry policy. If they don't have a retry policy, then requests will end with a cancellation exception instead 429.

@damiand2
Copy link

damiand2 commented Aug 4, 2023

@bgavrilMS - standard .net core web app that uses MS SQL as storage. As for solution you described - depending on what you mean by 'block similar requests to the token issuer until one succeeds' - if by block you mean task block with request to DB then we would be replacing one kind of problem with another one that to external code (my code) differs in nothing - we would still get very long request handling times/timeouts once 24 hours (basically meaning app is unavailable for a minute or two every 24h). If i may propose something then it would be better to spin a single request for new MI token few minutes before old one expires and allow all standard calls (here to DB) to use old token and that request (once finished) should silently replace old MI token in cache.

@gladjohn
Copy link
Contributor

gladjohn commented Aug 4, 2023

@gladjohn - did the SQL folks integrate with MI via MSAL.NET ? I guess it doesn't matter, if they use Azure Identity, they still rely on MSAL for the token caching.

AFAIK, SQL Client only uses PCA from MSAL. And they seem to be using Azure.Identity for MI scenarios.

@bgavrilMS
Copy link
Member

bgavrilMS commented Aug 7, 2023

@bgavrilMS - standard .net core web app that uses MS SQL as storage. As for solution you described - depending on what you mean by 'block similar requests to the token issuer until one succeeds' - if by block you mean task block with request to DB then we would be replacing one kind of problem with another one that to external code (my code) differs in nothing - we would still get very long request handling times/timeouts once 24 hours (basically meaning app is unavailable for a minute or two every 24h). If i may propose something then it would be better to spin a single request for new MI token few minutes before old one expires and allow all standard calls (here to DB) to use old token and that request (once finished) should silently replace old MI token in cache.

We need both. When the cache is empty and a bunch of threads rush to fetch the MI token, they will throttle the token issuing service. But then, when a token is in the cache, MSAL should re-new it at 1/2 interval, as long as you keep requesting a token, in a background thread. The first is not implemented. The latter is - it was added in MSAL 4.54 (changelog here)

Can you try to upgrade to use MSAL 5.54 4.54.0 or higher? Either update to the latest Azure Identity (the beta version), or directly reference MSAL.

https://www.nuget.org/packages/Azure.Identity/1.10.0-beta.1

@damiand2
Copy link

damiand2 commented Aug 7, 2023

@bgavrilMS - just to make sure - when you wrote "Can you try to upgrade to use MSAL 5.54 or higher?" - you meant 4.54, right?

@damiand2
Copy link

damiand2 commented Aug 9, 2023

ok, we will try to update MIcrosoft.Identity.Client package manually and see it if helps. I will let you know asap.

@damiand2
Copy link

Updating just MIcrosoft.Identity.Client package didn't help

@bgavrilMS
Copy link
Member

Thanks @damiand2 for the update. @gladjohn is actively working on this and can share a preview MSAL package if you are willing to test it out.

@damiand2
Copy link

@bgavrilMS - hi again, we manually upgraded Microsoft.Identity.Client.Extensions.Msal to 4.56 in our project, but problem did not go away - we still see torrent of concurrent MI token requests when old one expires.

@bgavrilMS bgavrilMS reopened this Sep 19, 2023
@bgavrilMS
Copy link
Member

I believe Azure SDK might need to do some integration work @gladjohn ?

@bgavrilMS bgavrilMS added external and removed epic labels Sep 19, 2023
@christothes
Copy link
Author

I believe Azure SDK might need to do some integration work @gladjohn ?

It sounds like @damiand2 manually referenced the new MSAL which should have lifted up the dependency.

@bgavrilMS
Copy link
Member

@christothes - do you have some test setup to validate / invalidate this fix?
@damiand2 - is it possible to get some MSAL logs?

FYI - the fix was to add some semaphores https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/blob/main/src/client/Microsoft.Identity.Client/Internal/Requests/ClientCredentialRequest.cs#L120

@damiand2
Copy link

@bgavrilMS - i could arrange some logs, you would just have to instruct me how to obtain them, i cannot set loglevel to Debug for every source, there would be too much data from production.

@bgavrilMS
Copy link
Member

MSAL does logging in its own way, see https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/wiki/logging

(mostly because MSAL is low level library, it does not depend on any logging infra)

@damiand2
Copy link

@bgavrilMS - unfortunately we are not creating our own msal client instances, we depend on AzureAD implementation (via MIcrosoft.Data.SqlClient and providing in connection string entry like Authentication=Active Directory Managed Identity;User Id=xxxx-xxxxx-xxxx-xxxx;). Is there a way to enforce MSAL logging in such scenario?

@christothes
Copy link
Author

@bgavrilMS - unfortunately we are not creating our own msal client instances, we depend on AzureAD implementation (via MIcrosoft.Data.SqlClient and providing in connection string entry like Authentication=Active Directory Managed Identity;User Id=xxxx-xxxxx-xxxx-xxxx;). Is there a way to enforce MSAL logging in such scenario?

If you have flexibility to try a newer version of SQLClient, we added a new preview feature to SqlClient version 5.2.0-preview3 that gives you full control of the credential provider via a callback.

@bgavrilMS
Copy link
Member

Ah, MSAL is so many layers down that I am not sure if MSAL infrastructure for token caching is used here and if the fix applies by just taking direct dependency to MSAL.

The callback approach might work - i.e. you just use MSAL directly (or Azure Identity) to get a token.

@damiand2
Copy link

@christothes - ok, i'm potentially willing to test out 5.2.0-preview3 - should i use that callback to enable more logging in msal client? Or do you have something else in mind?

@christothes
Copy link
Author

@christothes - ok, i'm potentially willing to test out 5.2.0-preview3 - should i use that callback to enable more logging in msal client? Or do you have something else in mind?

Which ever is easier for your scenario. Pure MSAL would be the best direct test, but you can also enable logging for Azure.Identiy (which enabled MSAL logging also) via https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/identity/Azure.Identity#logging

@damiand2
Copy link

@christothes - unfortunately SqlClient 5.2.0-preview3 is quite unstable (many random connection resets when connecting to db) so we cannot use it on production. I have to wait until SqlClient gets official release and let you know, unless you got other ideas how to test it earlier?

@christothes
Copy link
Author

@bgavrilMS
Copy link
Member

We consider this fixed.

@bgavrilMS bgavrilMS closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment