Cloud SQL IAM service account authentication failed for user #1174

enocom · 2023-02-23T17:22:43Z

Bug Description

An otherwise valid configuration on occasion will result in Cloud SQL IAM service account authentication failed for user.

The backend will log: Failed to validate access token (as opposed to access token expired).

While customers have seen this error in Cloud Run (possibly suggesting a CPU-throttling issue with the background refresh), it also appears on GKE.

Updating to the latest version has not resolved these occasional errors.

Example code (or command)

No response

Stacktrace

No response

Steps to reproduce?

Deploy an app that logs in with Auto IAM AuthN
Wait awhile
Observe occasional failures

Environment

OS type and version: Linux Container
Java SDK version: ?
Cloud SQL Java Socket Factory version: v1.10.0

Additional Details

No response

The text was updated successfully, but these errors were encountered:

enocom · 2023-02-28T18:08:16Z

Running a Spring Boot app with Hikari on a GCE instance. Hikari is configured with 5 connections set to refresh every 60s. I've also patched the Java connector to run a refresh cycle every 60s. We'll see if we can force out these IAM authn errors.

enocom · 2023-03-01T18:06:24Z

After running the above experiment for ~24 hours, I don't see any errors. Going to wrap this up in a container and run in on GKE to see if that helps.

sscheible · 2023-03-06T10:01:55Z

We specifically observed the issue only with Cloud Run instances' service accounts, and never with GCE. We're not using GKE in this context so I can't tell about that

michae1T · 2023-03-06T12:06:45Z

Observed this issue twice since January over three persistent cloud run nodes - cpu throttling disabled. Hikari pool size 10.

Feb 10th - v1.8.3 - project A - single node failure
March 3rd - v1.10.0 - project B - single node failure

Nodes recovered on their own after 12 and 17 minutes respectively. The failures were to previously functioning nodes - not newly started. No relevant problem indicators in metrics or logs around the time.

We have another service with a similar configuration that has not yet encountered the issue running v1.8.3 with 4 persistent nodes across 2 projects.

confirmed problem image based on eclipse-temurin:11.0.17_8-jre-focal, yet unaffected image based on eclipse-temurin:11.0.17_8-jre-alpine

service:

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after (xxxx)ms.
   (...)
Caused by: org.postgresql.util.PSQLException: FATAL: Cloud SQL IAM service account authentication failed for user "xxxx"

postgres:

FATAL:  Cloud SQL IAM service account authentication failed for user "xxxx""

Only theory so far is around system clock however since it infrequently repros not yet done extra investigation.

enocom · 2023-03-06T19:38:08Z

Thanks, folks. I'll try to reproduce this on Cloud Run as well.

vr · 2023-03-13T11:14:22Z

It is still happening from time to time on our clusters, slight less with the v1.10.0 version than with the v1.8.x, but still, we have rolled back to classical authentication (non-IAM) on production system.
We are using distroless java 17 docker image on GKE.
Do we know where/what can be the issue ?

enocom · 2023-03-13T16:24:22Z

Current hypothesis is that there's (another) race condition in the underlying auth library -- I'm working today on another reproduction attempt.

enocom · 2023-03-14T16:02:15Z

After running my patched connector in GKE for 24 hours, I still don't see any errors.

For reference, I'm running with a eclipse-temurin:17.0.6_10-jre-jammy base image.

enocom · 2023-03-15T15:14:24Z

Pushing my debug app to Cloud Run with 1 CPU allocated and 1 minimum instance.

michae1T · 2023-03-15T15:24:01Z

this just reproed again - cloud run persistent - single node starting 10:45 ET

enocom · 2023-03-15T15:26:42Z

Base image is eclipse-temurin:11.0.17_8-jre-focal?

michae1T · 2023-03-15T15:40:08Z

yes - however in this version we have downgraded to postgres-socket-factory-1.8.3. this is the 3rd incident since January and first time since my last message to be clear.

5 minutes prior to the exceptions there is a significant increase in logging activity along the lines of:

Mar 15, 2023 2:39:34 PM com.google.cloud.sql.core.CoreSocketFactory connect
INFO: Connecting to Cloud SQL instance [xxxx] via SSL socket.

about 50 of them in total. a handful of requests complete normally interspersed with this logging then all requests begin to fail as the pool is exhausted I suppose. fwiw, the timestamp printed in the above trace roughly matches the timestamp in cloud run logs - same second.

10:39 ET - socket factory cycling begins + first "Cloud SQL IAM service account authentication failed" PG logs
10:45 ET - "timeouts" begin
11:02 ET - full recovery without intervention

enocom · 2023-03-15T18:35:13Z

10:45 ET - "timeouts" begin

I assume that means you're seeing this:

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after (xxxx)ms.
   (...)
Caused by: org.postgresql.util.PSQLException: FATAL: Cloud SQL IAM service account authentication failed for user "xxxx"

Yes?

michae1T · 2023-03-15T20:15:45Z

Yes, correct

enocom · 2023-03-16T16:28:30Z

After running the app on Cloud Run for 24 hours, I still don't see this error.

For next steps, I'm going to switch base images to eclipse-temurin:11.0.17_8-jre-focal. Additionally, I'll put the GKE and Cloud Run deployment under constant load to see if this might be related to CPU-usage.

michae1T · 2023-03-16T17:38:56Z

Fwiw, given what I have seen CPU load is not necessary to reproduce it. Although maybe it will agitate things. 24 hours is also a bit of a short timeframe. Maybe I can give you a container with its logic stripped out that you can run on a larger pool of nodes?

enocom · 2023-03-16T18:02:24Z

Help reproducing would be much appreciated. Some customers have reported seeing this error on a much more frequent interval. For now, it's not clear what correlations there might be.

sscheible · 2023-03-17T08:09:50Z

I can share that we were seeing this issue~1000 times a day across ~14 Cloud Run instances before disabling CPU throttling in Cloud Run. After disabling CPU throttling, we are still seeing this ~200 times a day. So if you have a hard time reproducing, enable CPU throttling (on the assumption that the product should also reliably manage IAM connections for CPU throttled Cloud Run instances)

michae1T · 2023-03-17T11:49:51Z

very interesting that there is an increase correlating with cpu-throttling use. JVMs tend to struggle with GC under this configuration - leading back to CPU. the number of instances in your case will make it hard to detect but I am wondering if there is anything interesting on your P99/max for the effected service? is your CPU/memory usage high in general?

@enocom ok, I'll see what I can do - maybe you can combine my image with @sscheible's app config

enocom · 2023-03-17T15:35:57Z

Thanks for the additional information @sscheible and @michae1T. We've known that CPU throttling can cause problems in Cloud Run which is the motivation for #992. I wonder if GKE deployments are likewise resource constrained.

vr · 2023-03-22T15:06:46Z

In my case, this is clearly not resource constrained, it was happening more on application which get little traffic

enocom · 2023-03-22T15:31:34Z

In that case, I'm going to adjust my approach and restore the original connector and just run that for a few days.

After running the debug app in Cloud Run and GKE for ~week or so, I haven't seen a single occurrence of this error.

Meanwhile, if folks could let me know what base container image they're using, that might help.

michae1T · 2023-03-22T17:09:13Z

@enocom I attempted to create a repo with a distillation of the problem app image down to its base components where all the business logic is replaced with SELECT 1.

it should build a image with the same base, jars, jvm params, server config/init, cloud sql connection. you should be able to inject any additional logging as appropriate.

unfortunately though as highlighted before, the incident is fairly infrequent - you would need a pool of at least 120 nodes under the same circumstances to reproduce in 24 hours if you have the budget.

according to our experience you should not actually need to query the service to reproduce. to me there is no indication that the size of the connection pool is important as all the connections eventually die during an incident.

https://github.com/michae1T/cloudsql-pg-iam-fail-repro

enocom · 2023-03-22T17:10:26Z

Wow! Thanks @michae1T! I'll try this out and hopefully see the issue.

michae1T · 2023-03-22T17:12:20Z

np - fixed the repo link and hopefully helpful.

sscheible · 2023-03-27T15:21:41Z

In that case, I'm going to adjust my approach and restore the original connector and just run that for a few days.

After running the debug app in Cloud Run and GKE for ~week or so, I haven't seen a single occurrence of this error.

Meanwhile, if folks could let me know what base container image they're using, that might help.

We're using amazoncorretto:17-alpine

vr · 2023-03-28T14:18:07Z

we are using gcr.io/distroless/java17-debian11

enocom · 2023-03-29T19:15:30Z

So quick update -- after running on Cloud Run with 100 instances for almost a week, I'm not seeing this error.

Instead of keep guessing, I'm going to make a change to the connector here to throw on empty tokens or expired tokens (the two most likely problems). From there we can further isolate this to a problem in the credentials code or possibly in the backend (although I've only heard about this in Java).

vr · 2023-03-30T09:57:48Z

yup, we never had the issue with our cloud-sql-proxy setup, but we have less of this than Java one

sscheible · 2023-03-31T14:03:04Z

Seeing very promising results with cloud-gcp-dependencies 3.4.7 (from 3.4.3), which includes cloud-sql-jdbc-socket-factory#1147, in addition to having CPU throttling disabled. After monitoring over the weekend It seems that this almost eliminates the issue (down from ~200/day to ~2/day, so another 2 orders of magnitude).

N.B. we are still seeing at least some of these errors with 3.4.7 and throttling enabled, running some more tests to quantify

hanfi · 2023-04-03T09:20:49Z

hello

i'm also having this issue, here is my architecture :

cloud run
quarkus
mysql-socket-factory-connector-j-8 1.10.0 (just updated to 1.11.0 to test if feat: improve reliability of refresh operations #1147 fixes the issue, needs some time to observe)
mysql cloudsql 8.0 private instance
connection through Serverless VPC access instances

it happens randomly, i found corresponding logs on cloudsql logs :
CloudSQL Instance's IAM access denied for user xxxxx@xxxxx.iam.gserviceaccount.com: Error code :UNAUTHENTICATED Error Message :Failed to validate access token

enocom · 2023-04-03T15:13:38Z

Thanks @hanfi

If folks are seeing this bug, it would help to know:

The size of your deployment
The relative rate of these errors (1 out of 100 requests for example)
What base container and application you're using

hanfi · 2023-04-04T08:30:08Z

The size of your deployment

3 quarkus cloud runs (problem happens on the 3) use v1.10.0

The relative rate of these errors (1 out of 100 requests for example)

What base container and application you're using

registry.access.redhat.com/ubi8/openjdk-17:1.14-8 (quarkus docker image)

Related to #1174

enocom · 2023-04-04T20:28:31Z

I just merged #1233 which will throw an exception if an auth token is empty or expired (the two leading hypotheses). That will go out in our next release on Tuesday and I hope it will narrow down the issue here.

enocom · 2023-04-11T18:32:07Z

v1.11.1 now has a check for an invalid token. I recommend upgrading to see if these occasional errors are caused by an invalid token.

hanfi · 2023-04-18T13:13:28Z

thx @enocom, i dont find the error in the logs anymore :D

thx a lot seem fixed for me.

michae1T · 2023-04-19T09:57:19Z

no incident since March 15th on our side (#1174 (comment)). no code changes. based on previous pattern I would have expected it to reproduce in under a month. I wonder if our issue is not directly tied to the library even if it may be an aggravating factor. we will release with 1.11.1 now and continue to monitor.

enocom · 2023-04-19T15:30:08Z

Thanks for the update folks. I'll leave this open for now.

incomprendido · 2023-04-24T22:00:41Z

The ability to reproduce this may depend on the version of google-auth-library in use, other maven dependencies could pull in an older version of this library resulting in a variation of the following issue when credentials.refresh() is called in SqlAdminApiFetcher.java googleapis/google-auth-library-java#692 (comment).

@enocom you may have fixed this with the RuntimeExceptions thrown in the validateAccessToken method added to SqlAdminApiFetcher.java in version 1.11.1. RuntimeException should cause the onFailure call back in performRefresh method of CloudSqlInstance.java to run which then results in a retry of the refresh action.

enocom · 2023-04-25T15:25:54Z

That's a good point. I've omitted any logging and it seems the Java Connector doesn't always let exceptions bubble up, so I might have both fixed and concealed the problem.

vr · 2023-06-02T07:42:08Z

Still no issues here, can we consider it safe now ?

enocom · 2023-06-02T14:53:29Z

Yes, I believe so. I'm leaving this open until I add logging just to make it clear which case we're hitting (bad token, expired token). Otherwise, this is effectively fixed.

Fixes #1174

enocom added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 23, 2023

blunderbuss-gcf bot assigned shubha-rajan Feb 23, 2023

enocom assigned enocom and unassigned shubha-rajan Feb 23, 2023

enocom added a commit that referenced this issue Apr 4, 2023

fix: throw when Auto IAM AuthN is faulty

c391cd2

Related to #1174

enocom mentioned this issue Apr 4, 2023

fix: throw when token is expired or empty #1233

Merged

enocom added a commit that referenced this issue Apr 4, 2023

fix: throw when token is expired or empty (#1233)

970eed0

Related to #1174

enocom added a commit that referenced this issue Jun 2, 2023

fix: log error when token is invalid

45ac328

Fixes #1174

enocom mentioned this issue Jun 2, 2023

fix: log error when token is invalid #1313

Merged

enocom closed this as completed in #1313 Jun 2, 2023

enocom added a commit that referenced this issue Jun 2, 2023

fix: log error when token is invalid (#1313)

2130317

Fixes #1174

infra-db-release-bot mentioned this issue Jun 2, 2023

chore(main): release 1.12.0 #1312

Merged

enocom mentioned this issue Jun 29, 2023

chore: Use AccessTokenSupplier, remove duplicate code from CloudSqlInstance. #1330

Merged

Cloud SQL IAM service account authentication failed for user #1174

Cloud SQL IAM service account authentication failed for user #1174

Comments

enocom commented Feb 23, 2023

Bug Description

Example code (or command)

Stacktrace

Steps to reproduce?

Environment

Additional Details

enocom commented Feb 28, 2023 • edited Loading

enocom commented Mar 1, 2023 • edited Loading

sscheible commented Mar 6, 2023

michae1T commented Mar 6, 2023 • edited Loading

enocom commented Mar 6, 2023

vr commented Mar 13, 2023 • edited Loading

enocom commented Mar 13, 2023

enocom commented Mar 14, 2023

enocom commented Mar 15, 2023

michae1T commented Mar 15, 2023

enocom commented Mar 15, 2023 • edited Loading

michae1T commented Mar 15, 2023 • edited Loading

enocom commented Mar 15, 2023

michae1T commented Mar 15, 2023

enocom commented Mar 16, 2023

michae1T commented Mar 16, 2023

enocom commented Mar 16, 2023

sscheible commented Mar 17, 2023

michae1T commented Mar 17, 2023

enocom commented Mar 17, 2023

vr commented Mar 22, 2023

enocom commented Mar 22, 2023

michae1T commented Mar 22, 2023 • edited Loading

enocom commented Mar 22, 2023

michae1T commented Mar 22, 2023

sscheible commented Mar 27, 2023

vr commented Mar 28, 2023

enocom commented Mar 29, 2023

vr commented Mar 30, 2023

sscheible commented Mar 31, 2023 • edited Loading

hanfi commented Apr 3, 2023 • edited Loading

enocom commented Apr 3, 2023

hanfi commented Apr 4, 2023 • edited Loading

enocom commented Apr 4, 2023

enocom commented Apr 11, 2023

hanfi commented Apr 18, 2023

michae1T commented Apr 19, 2023

enocom commented Apr 19, 2023

incomprendido commented Apr 24, 2023

enocom commented Apr 25, 2023

vr commented Jun 2, 2023

enocom commented Jun 2, 2023

enocom commented Feb 28, 2023 •

edited

Loading

enocom commented Mar 1, 2023 •

edited

Loading

michae1T commented Mar 6, 2023 •

edited

Loading

vr commented Mar 13, 2023 •

edited

Loading

enocom commented Mar 15, 2023 •

edited

Loading

michae1T commented Mar 15, 2023 •

edited

Loading

michae1T commented Mar 22, 2023 •

edited

Loading

sscheible commented Mar 31, 2023 •

edited

Loading

hanfi commented Apr 3, 2023 •

edited

Loading

hanfi commented Apr 4, 2023 •

edited

Loading