Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Increasing DefaultAzureCredentialOptions MaxRetries dramatically slows down startup of app #33124

Closed
ddelapasseOII opened this issue Dec 20, 2022 · 6 comments
Assignees
Labels
Azure.Identity Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-author-feedback More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@ddelapasseOII
Copy link

Library name and version

Azure.Identity 1.7.0

Describe the bug

We've seen our azure app service crash twice now (about 30 days apart) and the only error message (pasted below) says that ManagedIdentityCredential authentication failed: retry failed after 3 tries.

I updated the code in our DefaultAzureCredential (see below) updating the MaxRetries to 6 (from 3) and the time our app took to get to the OnModelCreating surged from 38 to 54 seconds. With 9 MaxRetries the startup is 103 seconds. These numbers are pretty consistent. There were no key vault errors and this happens consistently every time since I updated the parameters. If I decrease the parameters the time goes back down.

But since my KeyVault IS accessible why would changing the MaxRetries make any difference?

        var isDeployed = !string.IsNullOrEmpty(Environment.GetEnvironmentVariable("WEBSITE_SITE_NAME"));
        return new DefaultAzureCredential(
            new DefaultAzureCredentialOptions
            {
                // Prevent deployed instances from trying things that don't work and generally take too long
                ExcludeInteractiveBrowserCredential = isDeployed,
                ExcludeVisualStudioCodeCredential = isDeployed,
                ExcludeVisualStudioCredential = isDeployed,
                ExcludeSharedTokenCacheCredential = isDeployed,
                ExcludeAzureCliCredential = isDeployed,
                ExcludeManagedIdentityCredential = false,
                Retry =
                {
                    **MaxRetries = 6,   //was 3
                    NetworkTimeout = TimeSpan.FromSeconds(10),
                    MaxDelay = TimeSpan.FromSeconds(10)**
                },

                // this helps devs use the right tenant
                InteractiveBrowserTenantId = defaultTenantId,
                SharedTokenCacheTenantId = defaultTenantId,
                VisualStudioCodeTenantId = defaultTenantId,
                VisualStudioTenantId = defaultTenantId
            }
        );

========== Message from Azure App Service when our app crashed today (this is why I'm trying to up the retries)

Application: w3wp.exe
CoreCLR Version: 6.0.1122.52304
.NET Version: 6.0.11
Description: The process was terminated due to an unhandled exception.
Exception Info: Azure.Identity.AuthenticationFailedException: ManagedIdentityCredential authentication failed: Retry failed after 3 tries. Retry settings can be adjusted in ClientOptions.Retry. (The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.)
See the troubleshooting guide for more information. https://aka.ms/azsdk/net/identity/managedidentitycredential/troubleshoot
---> System.AggregateException: Retry failed after 3 tries. Retry settings can be adjusted in ClientOptions.Retry. (The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.)
---> System.Threading.Tasks.TaskCanceledException: The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.
---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request..
---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request.
--- End of inner exception stack trace ---

Expected behavior

Updating MaxRetries from 3 to 6 in the code below shouldn't cause any delay UNLESS the key vault is unreachable.

return new DefaultAzureCredential(
new DefaultAzureCredentialOptions
{
// Prevent deployed instances from trying things that don't work and generally take too long
ExcludeInteractiveBrowserCredential = isDeployed,
ExcludeVisualStudioCodeCredential = isDeployed,
ExcludeVisualStudioCredential = isDeployed,
ExcludeSharedTokenCacheCredential = isDeployed,
ExcludeAzureCliCredential = isDeployed,
ExcludeManagedIdentityCredential = false,
Retry =
{
MaxRetries = 6,
NetworkTimeout = TimeSpan.FromSeconds(10),
MaxDelay = TimeSpan.FromSeconds(10)
},
...

Actual behavior

Our app startup time went from 36 to 54 seconds increasing MaxRetries from 3 to 6. For 9 retries startup time was > 100 seconds.

Reproduction Steps

Adjusted the MaxRetries as described above and observed startup timing.

Environment

Currently running locally on Win 10 laptop with 32GB memory, VS 2022 (64bit) v 17.2.2

.NET SDK (reflecting any global.json):
Version: 6.0.300
Commit: 8473146e7d

Runtime Environment:
OS Name: Windows
OS Version: 10.0.19044
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\6.0.300\

global.json file:
Not found

Host:
Version: 6.0.9
Architecture: x64
Commit: 163a63591c

.NET SDKs installed:
2.2.110 [C:\Program Files\dotnet\sdk]
3.1.301 [C:\Program Files\dotnet\sdk]
5.0.407 [C:\Program Files\dotnet\sdk]
6.0.300 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.30 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.30 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.24 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.25 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 6.0.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.30 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.8 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.24 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.25 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.9 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.1.24 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.25 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.16 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 6.0.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Dec 20, 2022
@jsquire jsquire added Client This issue points to a problem in the data-plane of the library. Azure.Identity needs-team-attention This issue needs attention from Azure service team or SDK team labels Dec 20, 2022
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Dec 20, 2022
@jsquire
Copy link
Member

jsquire commented Dec 20, 2022

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@jsquire
Copy link
Member

jsquire commented Dec 20, 2022

//cc: @schaabs

@jsquire
Copy link
Member

jsquire commented Dec 20, 2022

Because of the US holiday season, please expect delayed responses. There are two important things that I notice about your snippet and the error message:

  • The timeout that you're seeing is from DefaultAzureCredential attempting to talk to the managed identity endpoint.
Azure.Identity.AuthenticationFailedException: ManagedIdentityCredential authentication failed

The DefaultAzureCredential is a chained token source, meaning that it will flow through multiple credential types. If you are not intending to use managed identity for authorization, then I'd suggest disabling it by setting the ExcludeManagedIdentityCredential option.

If you are to use managed credential authorization, I'd suggest referencing the troubleshooting guide provided in the error message to help investigate.

  • Your snippet overrides the default 100 second network timeout and allows only 10 seconds (10% of the recommended value). Your stack trace consistently indicates that your requests to the mangaged identity endpoint are timing out and being cancelled as a result:
 The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.

If you are attempting to use managed identity authorization, this would seem to indicate that you are not giving the credential enough time to authenticate with your host's MI endpoint.

@ddelapasseOII
Copy link
Author

Hi @jsquire, We do require the ManagedIdentityCredential so that's not an option, but I guess my main concern is WHY would the retry count/timeout modification make so much difference when running locally (and managed identity is perfectly accessible). It's not failing, but it seems the sheer act of modifying the MaxRetries make such a large difference during startup? I did remove our modifications - will let them default for now and see if the prevents the (very) intermittent server crashes, but would love to get an explanation on how increasing MaxRetries makes such an impact if possible.

@jsquire
Copy link
Member

jsquire commented Dec 22, 2022

Can you help me understand your statement that "managed identity is perfectly accessible?" The error message and stack trace clearly indicate that requests to the managed identity endpoint are timing out:

 The operation was cancelled because it exceeded the configured timeout of 0:00:05. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.

It appears that the MI endpoint is intermittently not responding quickly enough to meet your timeout value. As mentioned, you're overriding the default and allowing about 10% of the recommended value.
Because this is considered a transient exception, it will be retried. Because this is retried, adding more retries will cause more requests and include a back-off time between them, increasing the time. However, because these requests still do not have sufficient time to interact with your MI endpoint, they'll continue to fail.

I'd suggest removing your override to the network timeout and testing with the default value of 100 seconds to see if that alleviates the issue.

@jsquire jsquire added needs-author-feedback More information is needed from author to address the issue. and removed needs-team-attention This issue needs attention from Azure service team or SDK team labels Dec 22, 2022
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Dec 31, 2022
@ghost
Copy link

ghost commented Dec 31, 2022

Hi, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@ghost ghost closed this as completed Jan 15, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Identity Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-author-feedback More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

3 participants