-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic Cloud SQL IAM service account authentication failed errors #800
Comments
A couple of questions:
As for suggestions:
When using |
They are periodic jobs that run every 10 min, for ~2-3 seconds.
Upon receiving a termination signal, a pod usually races to shut down and complete any remaining work. Even though we handle the context gracefully, there's always a chance a new job might start just before the server is signalled to shut down. We won't initiate any new jobs, but some may be on their "last run" and require a database connection. If it just so happens that the token is expired (and a refresh is necessary), why would this fail?
I think this is root of the problem, I presume when a pod is signalled to terminate the token cannot be refreshed? Fwiw this happens fairly infrequently. My goal is to understand why the token cannot be refreshed when using Alternatively, could this be an intermittent token refresh issue and less an issue with pod termination? If so, is there any verbose logging and/or debug we could enable to track this down? |
The Go Connector uses a refresh ahead cache internally that every 56 minutes refreshes the client cert by performing two SQL Admin API calls. If the token that authorizes the client to make those calls expires, the calls fail and the client certificate (and OAuth2 token within the cert) eventually expires, causing the authentication errors.
Correct. This is why we recommend using a static token source for one-off operations and otherwise recommend application default credentials with workload identity as the default. With Workload Identity, the Go Connector's underlying auth library can refresh the OAuth2 token for you automatically.
Could you show me how you're configuring your token source? In some cases (like with a static token source), I've seen the auth library not refresh the token and eventually expire. I assume that's what's happening here, but need to see how you're configuring a token source to say for certain.
You could enable debug logging -- that will report on the refresh operations and we'll be able to see for certain if the hypothesis about the token expiring is in fact true. |
For debug logging, use |
I enabled debug logging, so hopefully this helps us track down specific errors for the token refresh.
Fwiw just today I noticed this happen in a pod that was not terminated and had an uptime of many hours (>4h). It appears the token refresh did succeed since traffic continued to be served by that pod (and app), so it's definitely intermittent but I'm a bit surprised. Here's an example of what this looks like under the hood (most of this is adapted from the examples and tests): options := []cloudsqlconn.Option{
cloudsqlconn.WithIAMAuthN(),
cloudsqlconn.WithContextDebugLogger(...),
}
sqlTS, err := impersonate.CredentialsTokenSource(
ctx,
impersonate.CredentialsConfig{
TargetPrincipal: config.Impersonate,
Scopes: []string{"https://www.googleapis.com/auth/sqlservice.admin"},
},
)
if err != nil {
return nil, nil, err
}
loginTS, err := impersonate.CredentialsTokenSource(
ctx,
impersonate.CredentialsConfig{
TargetPrincipal: config.Impersonate,
Scopes: []string{"https://www.googleapis.com/auth/sqlservice.login"},
},
)
if err != nil {
return nil, nil, err
}
options = append(options, cloudsqlconn.WithIAMAuthNTokenSources(sqlTS, loginTS))
dialer, err := cloudsqlconn.NewDialer(ctx, options...)
if err != nil {
return nil, nil, err
}
pgconnDialFunc := func(ctx context.Context, network, address string) (net.Conn, error) {
return dialer.Dial(ctx, config.CloudSQLInstance, cloudsqlconn.WithPrivateIP())
} |
Your setup looks perfectly correct.
Same here. It can happen that the backend call to verify the IAM principal fails, but it should be uncommon. Debug logging will at least confirm whether the background refresh is working as intended. If it is, then we'll need to get the backend team to look at this. If you have a support contract, you might open a case and reference this thread. |
Minor update, it does appear to be working as intended. We'll continue to monitor this. But I think we should keep this issue open because I suspect this will happen again and hopefully we'll get more details on the underlying error. Thank you for your support.
|
We got a few alerts last night, and of interest was this in the cloudsql_database logs for the cloudsqliamserviceaccount.
And there was another one around the time of the first error at
So the times do match up, and I didn't see those errors otherwise. It appears to be an intermittent issue that happens (granted rarely) across all our clusters. That pod is still running and serving traffic, so my initial observation in previous comments that this issue may be related to pod termination was incorrect. Here's a bit more logging we got.
|
Thanks @mfridman. This at least shows us the background refresh isn't doing anything wrong. Based on past conversations with the backend team, I understand some small percentage of IAM authentication calls may fail. Do you have a ball part estimate for what you're seeing? |
I'd categorize it as very low, but I don't have a ballpark figure. Depending on the error, is it worth retrying the operation? Maybe a few retries (say 3) within a short duration (say 2s) would make it slightly more resilient against backend failures? Given the choice between an error and additional latency, I'd pick the latter. Could also make this an opt-in so the caller gets to choose. |
Right now the Go Connector doesn't understand the database protocol and just returns a connected socket to the driver. If you're using pgx directly, I wonder if you could retry calls to pool.Acquire to hide these auth errors. |
Question
Periodically, we see the following error:
This happens in background jobs that are in the process of being shut down. Do you have a suggestion on how to avoid this situation in GKE?
We've never experienced this specific issue when using passwords, so I presume there's some interplay between the app, its environment (GKE), and how IAM is handled by the pods when it receives a shutdown signal.
We're setting up the token sources as described in
WithIAMAuthNTokenSources
and thencloudsqlconn.WithPrivateIP()
along with the instance's connection name.Code
No response
Additional Details
cloud.google.com/go/cloudsqlconn v1.9.0
The text was updated successfully, but these errors were encountered: