Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic Cloud SQL IAM service account authentication failed errors #800

Open
mfridman opened this issue May 13, 2024 · 11 comments
Open

Periodic Cloud SQL IAM service account authentication failed errors #800

mfridman opened this issue May 13, 2024 · 11 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: cleanup An internal cleanup or hygiene concern. type: question Request for information or clarification.

Comments

@mfridman
Copy link

mfridman commented May 13, 2024

Question

Periodically, we see the following error:

server error (FATAL: Cloud SQL IAM service account authentication failed for user "<some_user>>@<some_account>.iam" (SQLSTATE 28000))

This happens in background jobs that are in the process of being shut down. Do you have a suggestion on how to avoid this situation in GKE?

We've never experienced this specific issue when using passwords, so I presume there's some interplay between the app, its environment (GKE), and how IAM is handled by the pods when it receives a shutdown signal.

We're setting up the token sources as described in WithIAMAuthNTokenSources and then cloudsqlconn.WithPrivateIP() along with the instance's connection name.

Code

No response

Additional Details

cloud.google.com/go/cloudsqlconn v1.9.0

@mfridman mfridman added the type: question Request for information or clarification. label May 13, 2024
@enocom
Copy link
Member

enocom commented May 14, 2024

A couple of questions:

  • How long do your background jobs typically run for?
  • Why are your jobs connecting to the database during shutdown?

As for suggestions:

When using WithIAMAUthNTokenSources, the Go Connector won't be able to refresh the token. So once it expires, I would expect you'd see errors like the one above. So it's best to use Workload Identity.

@mfridman
Copy link
Author

mfridman commented May 15, 2024

How long do your background jobs typically run for?

They are periodic jobs that run every 10 min, for ~2-3 seconds.

Why are your jobs connecting to the database during shutdown?

Upon receiving a termination signal, a pod usually races to shut down and complete any remaining work. Even though we handle the context gracefully, there's always a chance a new job might start just before the server is signalled to shut down. We won't initiate any new jobs, but some may be on their "last run" and require a database connection.

If it just so happens that the token is expired (and a refresh is necessary), why would this fail?

When using WithIAMAUthNTokenSources, the Go Connector won't be able to refresh the token. So once it expires, I would expect you'd see errors like the one above.

I think this is root of the problem, I presume when a pod is signalled to terminate the token cannot be refreshed?

Fwiw this happens fairly infrequently. My goal is to understand why the token cannot be refreshed when using WithIAMAUthNTokenSources. Also, is the token cached for 1 hour?

Alternatively, could this be an intermittent token refresh issue and less an issue with pod termination? If so, is there any verbose logging and/or debug we could enable to track this down?

@enocom
Copy link
Member

enocom commented May 15, 2024

If it just so happens that the token is expired (and a refresh is necessary), why would this fail?

The Go Connector uses a refresh ahead cache internally that every 56 minutes refreshes the client cert by performing two SQL Admin API calls. If the token that authorizes the client to make those calls expires, the calls fail and the client certificate (and OAuth2 token within the cert) eventually expires, causing the authentication errors.

I think this is root of the problem, I presume when a pod is signalled to terminate the token cannot be refreshed?

Correct. This is why we recommend using a static token source for one-off operations and otherwise recommend application default credentials with workload identity as the default. With Workload Identity, the Go Connector's underlying auth library can refresh the OAuth2 token for you automatically.

Fwiw this happens fairly infrequently. My goal is to understand why the token cannot be refreshed when using WithIAMAUthNTokenSources. Also, is the token cached for 1 hour?

Could you show me how you're configuring your token source? In some cases (like with a static token source), I've seen the auth library not refresh the token and eventually expire. I assume that's what's happening here, but need to see how you're configuring a token source to say for certain.

Alternatively, could this be an intermittent token refresh issue and less an issue with pod termination? If so, is there any verbose logging and/or debug we could enable to track this down?

You could enable debug logging -- that will report on the refresh operations and we'll be able to see for certain if the hypothesis about the token expiring is in fact true.

@enocom
Copy link
Member

enocom commented May 15, 2024

For debug logging, use cloudsqlconn.WithContextLogger.

@enocom enocom assigned enocom and unassigned hessjcg May 15, 2024
@mfridman
Copy link
Author

I enabled debug logging, so hopefully this helps us track down specific errors for the token refresh.

I've seen the auth library not refresh the token and eventually expire. I assume that's what's happening here, but need to see how you're configuring a token source to say for certain.

Fwiw just today I noticed this happen in a pod that was not terminated and had an uptime of many hours (>4h). It appears the token refresh did succeed since traffic continued to be served by that pod (and app), so it's definitely intermittent but I'm a bit surprised.

Here's an example of what this looks like under the hood (most of this is adapted from the examples and tests):

options := []cloudsqlconn.Option{
	cloudsqlconn.WithIAMAuthN(),
	cloudsqlconn.WithContextDebugLogger(...),
}

sqlTS, err := impersonate.CredentialsTokenSource(
	ctx,
	impersonate.CredentialsConfig{
		TargetPrincipal: config.Impersonate,
		Scopes:          []string{"https://www.googleapis.com/auth/sqlservice.admin"},
	},
)
if err != nil {
	return nil, nil, err
}
loginTS, err := impersonate.CredentialsTokenSource(
	ctx,
	impersonate.CredentialsConfig{
		TargetPrincipal: config.Impersonate,
		Scopes:          []string{"https://www.googleapis.com/auth/sqlservice.login"},
	},
)
if err != nil {
	return nil, nil, err
}
options = append(options, cloudsqlconn.WithIAMAuthNTokenSources(sqlTS, loginTS))

dialer, err := cloudsqlconn.NewDialer(ctx, options...)
if err != nil {
	return nil, nil, err
}
pgconnDialFunc := func(ctx context.Context, network, address string) (net.Conn, error) {
	return dialer.Dial(ctx, config.CloudSQLInstance, cloudsqlconn.WithPrivateIP())
}

@enocom
Copy link
Member

enocom commented May 15, 2024

Your setup looks perfectly correct.

so it's definitely intermittent but I'm a bit surprised.

Same here. It can happen that the backend call to verify the IAM principal fails, but it should be uncommon.

Debug logging will at least confirm whether the background refresh is working as intended. If it is, then we'll need to get the backend team to look at this. If you have a support contract, you might open a case and reference this thread.

@mfridman
Copy link
Author

Minor update, it does appear to be working as intended. We'll continue to monitor this. But I think we should keep this issue open because I suspect this will happen again and hopefully we'll get more details on the underlying error.

Thank you for your support.

Connection info refresh operation scheduled at 2024-05-16T13:20:46Z (now + 56m0s)
Connection info refresh operation complete
Connection info refresh operation started
Connection info added to cache
Connection info refresh operation scheduled at 2024-05-16T13:20:25Z (now + 56m0s)
Connection info refresh operation complete
Connection info refresh operation started
Connection info added to cache

@mfridman
Copy link
Author

mfridman commented May 21, 2024

We got a few alerts last night, and of interest was this in the cloudsql_database logs for the cloudsqliamserviceaccount.

2024-05-21T01:36:56.719750Z DETAIL: Credentials authentication failed\nConnection matched pg_hba.conf line 26 "...

And there was another one around the time of the first error at

2024-05-21T01:31:56.771516Z DETAIL: Credentials authentication failed\nConnection matched pg_hba.conf line 26 "...

So the times do match up, and I didn't see those errors otherwise.


It appears to be an intermittent issue that happens (granted rarely) across all our clusters. That pod is still running and serving traffic, so my initial observation in previous comments that this issue may be related to pod termination was incorrect.

Here's a bit more logging we got.

Fail #1
server error (FATAL: Cloud SQL IAM service account authentication failed for user "some_user@some_account.iam" (SQLSTATE 28000))


2024-05-21T01:31:51.637573838Z [spanId: b6ada0833320d66f] Now = 2024-05-21T01:31:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:31:51.646609058Z [spanId: b6ada0833320d66f] Dialing <ip>:<port>
2024-05-21T01:31:51.646604178Z [spanId: b6ada0833320d66f] Cert is valid = true
2024-05-21T01:31:51.646583547Z [spanId: b6ada0833320d66f] Now = 2024-05-21T01:31:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:31:51.637610658Z [spanId: b6ada0833320d66f] Dialing <ip>:<port>
2024-05-21T01:31:51.637605138Z [spanId: b6ada0833320d66f] Cert is valid = true

No errors
2024-05-21T01:32:01.677066925Z [spanId: 1c895344a9319de7] Dialing <ip>:<port>
2024-05-21T01:32:01.677058585Z [spanId: 1c895344a9319de7] Cert is valid = true
2024-05-21T01:32:01.677033825Z [spanId: 1c895344a9319de7] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z
2024-05-21T01:32:01.668646365Z [spanId: 1c895344a9319de7] Dialing <ip>:<port>
2024-05-21T01:32:01.668640315Z [spanId: 1c895344a9319de7] Cert is valid = true
2024-05-21T01:32:01.668582635Z [spanId: 1c895344a9319de7] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z
2024-05-21T01:32:01.594015583Z [spanId: 7e32753d96ce643a] Dialing <ip>:<port>
2024-05-21T01:32:01.594009073Z [spanId: 7e32753d96ce643a] Cert is valid = true
2024-05-21T01:32:01.593984683Z [spanId: 7e32753d96ce643a] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z
2024-05-21T01:32:01.587584384Z [spanId: 7e32753d96ce643a] Dialing <ip>:<port>
2024-05-21T01:32:01.587578124Z [spanId: 7e32753d96ce643a] Cert is valid = true
2024-05-21T01:32:01.587530024Z [spanId: 7e32753d96ce643a] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z

Fail #2
server error (FATAL: Cloud SQL IAM service account authentication failed for user "some_user@some_account.iam" (SQLSTATE 28000))

2024-05-21T01:36:51.648763983Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.648753223Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.648736833Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.648265153Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.648259623Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.648233583Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.646296843Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.646291933Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.646270343Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.637149304Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.637144773Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.637136533Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.637077044Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.637061024Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.636883953Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.636756704Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.636751653Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.636739744Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z

@enocom
Copy link
Member

enocom commented May 21, 2024

Thanks @mfridman. This at least shows us the background refresh isn't doing anything wrong.

Based on past conversations with the backend team, I understand some small percentage of IAM authentication calls may fail. Do you have a ball part estimate for what you're seeing?

@mfridman
Copy link
Author

mfridman commented May 21, 2024

I'd categorize it as very low, but I don't have a ballpark figure.

Depending on the error, is it worth retrying the operation? Maybe a few retries (say 3) within a short duration (say 2s) would make it slightly more resilient against backend failures? Given the choice between an error and additional latency, I'd pick the latter. Could also make this an opt-in so the caller gets to choose.

@enocom
Copy link
Member

enocom commented May 23, 2024

Right now the Go Connector doesn't understand the database protocol and just returns a connected socket to the driver. If you're using pgx directly, I wonder if you could retry calls to pool.Acquire to hide these auth errors.

@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: cleanup An internal cleanup or hygiene concern. labels May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: cleanup An internal cleanup or hygiene concern. type: question Request for information or clarification.
Projects
None yet
Development

No branches or pull requests

3 participants