Periodic Cloud SQL IAM service account authentication failed errors #800

mfridman · 2024-05-13T22:11:10Z

Question

Periodically, we see the following error:

server error (FATAL: Cloud SQL IAM service account authentication failed for user "<some_user>>@<some_account>.iam" (SQLSTATE 28000))

This happens in background jobs that are in the process of being shut down. Do you have a suggestion on how to avoid this situation in GKE?

We've never experienced this specific issue when using passwords, so I presume there's some interplay between the app, its environment (GKE), and how IAM is handled by the pods when it receives a shutdown signal.

We're setting up the token sources as described in WithIAMAuthNTokenSources and then cloudsqlconn.WithPrivateIP() along with the instance's connection name.

Code

No response

Additional Details

cloud.google.com/go/cloudsqlconn v1.9.0

The text was updated successfully, but these errors were encountered:

enocom · 2024-05-14T15:24:08Z

A couple of questions:

How long do your background jobs typically run for?
Why are your jobs connecting to the database during shutdown?

As for suggestions:

The WithIAMAuthNTokenSources is really meant as a convenience for people who want IAM AuthN in a docker container that isn't running in a Google-managed environment. Instead, I'd suggest using Workload Identity which will remove the need to set any auth options at all and have the Go Connector automatically pick up credentials from the metadata server running in GKE.
Probably not related, but FYI there is github.com/"Cloud SQL IAM service account authentication failed for user ..." intermittent errors when connecting to Postgres cloud-sql-proxy#2212 which is for a similar issue.

When using WithIAMAUthNTokenSources, the Go Connector won't be able to refresh the token. So once it expires, I would expect you'd see errors like the one above. So it's best to use Workload Identity.

mfridman · 2024-05-15T14:11:38Z

How long do your background jobs typically run for?

They are periodic jobs that run every 10 min, for ~2-3 seconds.

Why are your jobs connecting to the database during shutdown?

Upon receiving a termination signal, a pod usually races to shut down and complete any remaining work. Even though we handle the context gracefully, there's always a chance a new job might start just before the server is signalled to shut down. We won't initiate any new jobs, but some may be on their "last run" and require a database connection.

If it just so happens that the token is expired (and a refresh is necessary), why would this fail?

When using WithIAMAUthNTokenSources, the Go Connector won't be able to refresh the token. So once it expires, I would expect you'd see errors like the one above.

I think this is root of the problem, I presume when a pod is signalled to terminate the token cannot be refreshed?

Fwiw this happens fairly infrequently. My goal is to understand why the token cannot be refreshed when using WithIAMAUthNTokenSources. Also, is the token cached for 1 hour?

Alternatively, could this be an intermittent token refresh issue and less an issue with pod termination? If so, is there any verbose logging and/or debug we could enable to track this down?

enocom · 2024-05-15T17:17:53Z

If it just so happens that the token is expired (and a refresh is necessary), why would this fail?

The Go Connector uses a refresh ahead cache internally that every 56 minutes refreshes the client cert by performing two SQL Admin API calls. If the token that authorizes the client to make those calls expires, the calls fail and the client certificate (and OAuth2 token within the cert) eventually expires, causing the authentication errors.

I think this is root of the problem, I presume when a pod is signalled to terminate the token cannot be refreshed?

Correct. This is why we recommend using a static token source for one-off operations and otherwise recommend application default credentials with workload identity as the default. With Workload Identity, the Go Connector's underlying auth library can refresh the OAuth2 token for you automatically.

Fwiw this happens fairly infrequently. My goal is to understand why the token cannot be refreshed when using WithIAMAUthNTokenSources. Also, is the token cached for 1 hour?

Could you show me how you're configuring your token source? In some cases (like with a static token source), I've seen the auth library not refresh the token and eventually expire. I assume that's what's happening here, but need to see how you're configuring a token source to say for certain.

Alternatively, could this be an intermittent token refresh issue and less an issue with pod termination? If so, is there any verbose logging and/or debug we could enable to track this down?

You could enable debug logging -- that will report on the refresh operations and we'll be able to see for certain if the hypothesis about the token expiring is in fact true.

enocom · 2024-05-15T17:18:58Z

For debug logging, use cloudsqlconn.WithContextLogger.

mfridman · 2024-05-15T19:53:15Z

I enabled debug logging, so hopefully this helps us track down specific errors for the token refresh.

I've seen the auth library not refresh the token and eventually expire. I assume that's what's happening here, but need to see how you're configuring a token source to say for certain.

Fwiw just today I noticed this happen in a pod that was not terminated and had an uptime of many hours (>4h). It appears the token refresh did succeed since traffic continued to be served by that pod (and app), so it's definitely intermittent but I'm a bit surprised.

Here's an example of what this looks like under the hood (most of this is adapted from the examples and tests):

options := []cloudsqlconn.Option{
	cloudsqlconn.WithIAMAuthN(),
	cloudsqlconn.WithContextDebugLogger(...),
}

sqlTS, err := impersonate.CredentialsTokenSource(
	ctx,
	impersonate.CredentialsConfig{
		TargetPrincipal: config.Impersonate,
		Scopes:          []string{"https://www.googleapis.com/auth/sqlservice.admin"},
	},
)
if err != nil {
	return nil, nil, err
}
loginTS, err := impersonate.CredentialsTokenSource(
	ctx,
	impersonate.CredentialsConfig{
		TargetPrincipal: config.Impersonate,
		Scopes:          []string{"https://www.googleapis.com/auth/sqlservice.login"},
	},
)
if err != nil {
	return nil, nil, err
}
options = append(options, cloudsqlconn.WithIAMAuthNTokenSources(sqlTS, loginTS))

dialer, err := cloudsqlconn.NewDialer(ctx, options...)
if err != nil {
	return nil, nil, err
}
pgconnDialFunc := func(ctx context.Context, network, address string) (net.Conn, error) {
	return dialer.Dial(ctx, config.CloudSQLInstance, cloudsqlconn.WithPrivateIP())
}

enocom · 2024-05-15T20:32:56Z

Your setup looks perfectly correct.

so it's definitely intermittent but I'm a bit surprised.

Same here. It can happen that the backend call to verify the IAM principal fails, but it should be uncommon.

Debug logging will at least confirm whether the background refresh is working as intended. If it is, then we'll need to get the backend team to look at this. If you have a support contract, you might open a case and reference this thread.

mfridman · 2024-05-16T14:11:12Z

Minor update, it does appear to be working as intended. We'll continue to monitor this. But I think we should keep this issue open because I suspect this will happen again and hopefully we'll get more details on the underlying error.

Thank you for your support.

Connection info refresh operation scheduled at 2024-05-16T13:20:46Z (now + 56m0s)
Connection info refresh operation complete
Connection info refresh operation started
Connection info added to cache
Connection info refresh operation scheduled at 2024-05-16T13:20:25Z (now + 56m0s)
Connection info refresh operation complete
Connection info refresh operation started
Connection info added to cache

mfridman · 2024-05-21T12:44:40Z

We got a few alerts last night, and of interest was this in the cloudsql_database logs for the cloudsqliamserviceaccount.

2024-05-21T01:36:56.719750Z DETAIL: Credentials authentication failed\nConnection matched pg_hba.conf line 26 "...

And there was another one around the time of the first error at

2024-05-21T01:31:56.771516Z DETAIL: Credentials authentication failed\nConnection matched pg_hba.conf line 26 "...

So the times do match up, and I didn't see those errors otherwise.

It appears to be an intermittent issue that happens (granted rarely) across all our clusters. That pod is still running and serving traffic, so my initial observation in previous comments that this issue may be related to pod termination was incorrect.

Here's a bit more logging we got.

Fail #1
server error (FATAL: Cloud SQL IAM service account authentication failed for user "some_user@some_account.iam" (SQLSTATE 28000))


2024-05-21T01:31:51.637573838Z [spanId: b6ada0833320d66f] Now = 2024-05-21T01:31:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:31:51.646609058Z [spanId: b6ada0833320d66f] Dialing <ip>:<port>
2024-05-21T01:31:51.646604178Z [spanId: b6ada0833320d66f] Cert is valid = true
2024-05-21T01:31:51.646583547Z [spanId: b6ada0833320d66f] Now = 2024-05-21T01:31:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:31:51.637610658Z [spanId: b6ada0833320d66f] Dialing <ip>:<port>
2024-05-21T01:31:51.637605138Z [spanId: b6ada0833320d66f] Cert is valid = true

No errors
2024-05-21T01:32:01.677066925Z [spanId: 1c895344a9319de7] Dialing <ip>:<port>
2024-05-21T01:32:01.677058585Z [spanId: 1c895344a9319de7] Cert is valid = true
2024-05-21T01:32:01.677033825Z [spanId: 1c895344a9319de7] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z
2024-05-21T01:32:01.668646365Z [spanId: 1c895344a9319de7] Dialing <ip>:<port>
2024-05-21T01:32:01.668640315Z [spanId: 1c895344a9319de7] Cert is valid = true
2024-05-21T01:32:01.668582635Z [spanId: 1c895344a9319de7] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z
2024-05-21T01:32:01.594015583Z [spanId: 7e32753d96ce643a] Dialing <ip>:<port>
2024-05-21T01:32:01.594009073Z [spanId: 7e32753d96ce643a] Cert is valid = true
2024-05-21T01:32:01.593984683Z [spanId: 7e32753d96ce643a] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z
2024-05-21T01:32:01.587584384Z [spanId: 7e32753d96ce643a] Dialing <ip>:<port>
2024-05-21T01:32:01.587578124Z [spanId: 7e32753d96ce643a] Cert is valid = true
2024-05-21T01:32:01.587530024Z [spanId: 7e32753d96ce643a] Now = 2024-05-21T01:32:01Z, Current cert expiration = 2024-05-21T02:11:49Z

Fail #2
server error (FATAL: Cloud SQL IAM service account authentication failed for user "some_user@some_account.iam" (SQLSTATE 28000))

2024-05-21T01:36:51.648763983Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.648753223Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.648736833Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.648265153Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.648259623Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.648233583Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.646296843Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.646291933Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.646270343Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.637149304Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.637144773Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.637136533Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.637077044Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.637061024Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.636883953Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z
2024-05-21T01:36:51.636756704Z [spanId: 08d132b6aea91544] Dialing <ip>:<port>
2024-05-21T01:36:51.636751653Z [spanId: 08d132b6aea91544] Cert is valid = true
2024-05-21T01:36:51.636739744Z [spanId: 08d132b6aea91544] Now = 2024-05-21T01:36:51Z, Current cert expiration = 2024-05-21T02:10:49Z

enocom · 2024-05-21T22:02:51Z

Thanks @mfridman. This at least shows us the background refresh isn't doing anything wrong.

Based on past conversations with the backend team, I understand some small percentage of IAM authentication calls may fail. Do you have a ball part estimate for what you're seeing?

mfridman · 2024-05-21T23:09:15Z

I'd categorize it as very low, but I don't have a ballpark figure.

Depending on the error, is it worth retrying the operation? Maybe a few retries (say 3) within a short duration (say 2s) would make it slightly more resilient against backend failures? Given the choice between an error and additional latency, I'd pick the latter. Could also make this an opt-in so the caller gets to choose.

enocom · 2024-05-23T18:23:57Z

Right now the Go Connector doesn't understand the database protocol and just returns a connected socket to the driver. If you're using pgx directly, I wonder if you could retry calls to pool.Acquire to hide these auth errors.

mfridman added the type: question Request for information or clarification. label May 13, 2024

blunderbuss-gcf bot assigned hessjcg May 13, 2024

enocom assigned enocom and unassigned hessjcg May 15, 2024

enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: cleanup An internal cleanup or hygiene concern. labels May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodic Cloud SQL IAM service account authentication failed errors #800

Periodic Cloud SQL IAM service account authentication failed errors #800

mfridman commented May 13, 2024 •

edited

enocom commented May 14, 2024

mfridman commented May 15, 2024 •

edited

enocom commented May 15, 2024

enocom commented May 15, 2024

mfridman commented May 15, 2024

enocom commented May 15, 2024

mfridman commented May 16, 2024

mfridman commented May 21, 2024 •

edited

enocom commented May 21, 2024

mfridman commented May 21, 2024 •

edited

enocom commented May 23, 2024

Periodic Cloud SQL IAM service account authentication failed errors #800

Periodic Cloud SQL IAM service account authentication failed errors #800

Comments

mfridman commented May 13, 2024 • edited

Question

Code

Additional Details

enocom commented May 14, 2024

mfridman commented May 15, 2024 • edited

enocom commented May 15, 2024

enocom commented May 15, 2024

mfridman commented May 15, 2024

enocom commented May 15, 2024

mfridman commented May 16, 2024

mfridman commented May 21, 2024 • edited

enocom commented May 21, 2024

mfridman commented May 21, 2024 • edited

enocom commented May 23, 2024

mfridman commented May 13, 2024 •

edited

mfridman commented May 15, 2024 •

edited

mfridman commented May 21, 2024 •

edited

mfridman commented May 21, 2024 •

edited