Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud SQL IAM service account authentication failed for user #1174

Closed
enocom opened this issue Feb 23, 2023 · 42 comments · Fixed by #1313 or #1312
Closed

Cloud SQL IAM service account authentication failed for user #1174

enocom opened this issue Feb 23, 2023 · 42 comments · Fixed by #1313 or #1312
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@enocom
Copy link
Member

enocom commented Feb 23, 2023

Bug Description

An otherwise valid configuration on occasion will result in Cloud SQL IAM service account authentication failed for user.

The backend will log: Failed to validate access token (as opposed to access token expired).

While customers have seen this error in Cloud Run (possibly suggesting a CPU-throttling issue with the background refresh), it also appears on GKE.

Updating to the latest version has not resolved these occasional errors.

Example code (or command)

No response

Stacktrace

No response

Steps to reproduce?

  1. Deploy an app that logs in with Auto IAM AuthN
  2. Wait awhile
  3. Observe occasional failures

Environment

  1. OS type and version: Linux Container
  2. Java SDK version: ?
  3. Cloud SQL Java Socket Factory version: v1.10.0

Additional Details

No response

@enocom enocom added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 23, 2023
@enocom enocom assigned enocom and unassigned shubha-rajan Feb 23, 2023
@enocom
Copy link
Member Author

enocom commented Feb 28, 2023

Running a Spring Boot app with Hikari on a GCE instance. Hikari is configured with 5 connections set to refresh every 60s. I've also patched the Java connector to run a refresh cycle every 60s. We'll see if we can force out these IAM authn errors.

@enocom
Copy link
Member Author

enocom commented Mar 1, 2023

After running the above experiment for ~24 hours, I don't see any errors. Going to wrap this up in a container and run in on GKE to see if that helps.

@sscheible
Copy link

We specifically observed the issue only with Cloud Run instances' service accounts, and never with GCE. We're not using GKE in this context so I can't tell about that

@michae1T
Copy link

michae1T commented Mar 6, 2023

Observed this issue twice since January over three persistent cloud run nodes - cpu throttling disabled. Hikari pool size 10.

Feb 10th - v1.8.3 - project A - single node failure
March 3rd - v1.10.0 - project B - single node failure

Nodes recovered on their own after 12 and 17 minutes respectively. The failures were to previously functioning nodes - not newly started. No relevant problem indicators in metrics or logs around the time.

We have another service with a similar configuration that has not yet encountered the issue running v1.8.3 with 4 persistent nodes across 2 projects.

confirmed problem image based on eclipse-temurin:11.0.17_8-jre-focal, yet unaffected image based on eclipse-temurin:11.0.17_8-jre-alpine

service:

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after (xxxx)ms.
   (...)
Caused by: org.postgresql.util.PSQLException: FATAL: Cloud SQL IAM service account authentication failed for user "xxxx"

postgres:

FATAL:  Cloud SQL IAM service account authentication failed for user "xxxx""

Only theory so far is around system clock however since it infrequently repros not yet done extra investigation.

@enocom
Copy link
Member Author

enocom commented Mar 6, 2023

Thanks, folks. I'll try to reproduce this on Cloud Run as well.

@vr
Copy link

vr commented Mar 13, 2023

It is still happening from time to time on our clusters, slight less with the v1.10.0 version than with the v1.8.x, but still, we have rolled back to classical authentication (non-IAM) on production system.
We are using distroless java 17 docker image on GKE.
Do we know where/what can be the issue ?

@enocom
Copy link
Member Author

enocom commented Mar 13, 2023

Current hypothesis is that there's (another) race condition in the underlying auth library -- I'm working today on another reproduction attempt.

@enocom
Copy link
Member Author

enocom commented Mar 14, 2023

After running my patched connector in GKE for 24 hours, I still don't see any errors.

For reference, I'm running with a eclipse-temurin:17.0.6_10-jre-jammy base image.

@enocom
Copy link
Member Author

enocom commented Mar 15, 2023

Pushing my debug app to Cloud Run with 1 CPU allocated and 1 minimum instance.

@michae1T
Copy link

this just reproed again - cloud run persistent - single node starting 10:45 ET

@enocom
Copy link
Member Author

enocom commented Mar 15, 2023

Base image is eclipse-temurin:11.0.17_8-jre-focal?

@michae1T
Copy link

michae1T commented Mar 15, 2023

yes - however in this version we have downgraded to postgres-socket-factory-1.8.3. this is the 3rd incident since January and first time since my last message to be clear.

5 minutes prior to the exceptions there is a significant increase in logging activity along the lines of:

Mar 15, 2023 2:39:34 PM com.google.cloud.sql.core.CoreSocketFactory connect
INFO: Connecting to Cloud SQL instance [xxxx] via SSL socket.  

about 50 of them in total. a handful of requests complete normally interspersed with this logging then all requests begin to fail as the pool is exhausted I suppose. fwiw, the timestamp printed in the above trace roughly matches the timestamp in cloud run logs - same second.

10:39 ET - socket factory cycling begins + first "Cloud SQL IAM service account authentication failed" PG logs
10:45 ET - "timeouts" begin
11:02 ET - full recovery without intervention

@enocom
Copy link
Member Author

enocom commented Mar 15, 2023

10:45 ET - "timeouts" begin

I assume that means you're seeing this:

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after (xxxx)ms.
   (...)
Caused by: org.postgresql.util.PSQLException: FATAL: Cloud SQL IAM service account authentication failed for user "xxxx"

Yes?

@michae1T
Copy link

Yes, correct

@enocom
Copy link
Member Author

enocom commented Mar 16, 2023

After running the app on Cloud Run for 24 hours, I still don't see this error.

For next steps, I'm going to switch base images to eclipse-temurin:11.0.17_8-jre-focal. Additionally, I'll put the GKE and Cloud Run deployment under constant load to see if this might be related to CPU-usage.

@michae1T
Copy link

Fwiw, given what I have seen CPU load is not necessary to reproduce it. Although maybe it will agitate things. 24 hours is also a bit of a short timeframe. Maybe I can give you a container with its logic stripped out that you can run on a larger pool of nodes?

@enocom
Copy link
Member Author

enocom commented Mar 16, 2023

Help reproducing would be much appreciated. Some customers have reported seeing this error on a much more frequent interval. For now, it's not clear what correlations there might be.

@sscheible
Copy link

I can share that we were seeing this issue~1000 times a day across ~14 Cloud Run instances before disabling CPU throttling in Cloud Run. After disabling CPU throttling, we are still seeing this ~200 times a day. So if you have a hard time reproducing, enable CPU throttling (on the assumption that the product should also reliably manage IAM connections for CPU throttled Cloud Run instances)

@michae1T
Copy link

very interesting that there is an increase correlating with cpu-throttling use. JVMs tend to struggle with GC under this configuration - leading back to CPU. the number of instances in your case will make it hard to detect but I am wondering if there is anything interesting on your P99/max for the effected service? is your CPU/memory usage high in general?

@enocom ok, I'll see what I can do - maybe you can combine my image with @sscheible's app config

@enocom
Copy link
Member Author

enocom commented Mar 17, 2023

Thanks for the additional information @sscheible and @michae1T. We've known that CPU throttling can cause problems in Cloud Run which is the motivation for #992. I wonder if GKE deployments are likewise resource constrained.

@vr
Copy link

vr commented Mar 22, 2023

In my case, this is clearly not resource constrained, it was happening more on application which get little traffic

@enocom
Copy link
Member Author

enocom commented Mar 22, 2023

In that case, I'm going to adjust my approach and restore the original connector and just run that for a few days.

After running the debug app in Cloud Run and GKE for ~week or so, I haven't seen a single occurrence of this error.

Meanwhile, if folks could let me know what base container image they're using, that might help.

@michae1T
Copy link

michae1T commented Mar 22, 2023

@enocom I attempted to create a repo with a distillation of the problem app image down to its base components where all the business logic is replaced with SELECT 1.

it should build a image with the same base, jars, jvm params, server config/init, cloud sql connection. you should be able to inject any additional logging as appropriate.

unfortunately though as highlighted before, the incident is fairly infrequent - you would need a pool of at least 120 nodes under the same circumstances to reproduce in 24 hours if you have the budget.

according to our experience you should not actually need to query the service to reproduce. to me there is no indication that the size of the connection pool is important as all the connections eventually die during an incident.

https://github.com/michae1T/cloudsql-pg-iam-fail-repro

@enocom
Copy link
Member Author

enocom commented Mar 22, 2023

Wow! Thanks @michae1T! I'll try this out and hopefully see the issue.

@michae1T
Copy link

np - fixed the repo link and hopefully helpful.

@sscheible
Copy link

In that case, I'm going to adjust my approach and restore the original connector and just run that for a few days.

After running the debug app in Cloud Run and GKE for ~week or so, I haven't seen a single occurrence of this error.

Meanwhile, if folks could let me know what base container image they're using, that might help.

We're using amazoncorretto:17-alpine

@vr
Copy link

vr commented Mar 28, 2023

we are using gcr.io/distroless/java17-debian11

@enocom
Copy link
Member Author

enocom commented Mar 29, 2023

So quick update -- after running on Cloud Run with 100 instances for almost a week, I'm not seeing this error.

Instead of keep guessing, I'm going to make a change to the connector here to throw on empty tokens or expired tokens (the two most likely problems). From there we can further isolate this to a problem in the credentials code or possibly in the backend (although I've only heard about this in Java).

@vr
Copy link

vr commented Mar 30, 2023

yup, we never had the issue with our cloud-sql-proxy setup, but we have less of this than Java one

@sscheible
Copy link

sscheible commented Mar 31, 2023

Seeing very promising results with cloud-gcp-dependencies 3.4.7 (from 3.4.3), which includes cloud-sql-jdbc-socket-factory#1147, in addition to having CPU throttling disabled. After monitoring over the weekend It seems that this almost eliminates the issue (down from ~200/day to ~2/day, so another 2 orders of magnitude).

N.B. we are still seeing at least some of these errors with 3.4.7 and throttling enabled, running some more tests to quantify

@hanfi
Copy link

hanfi commented Apr 3, 2023

hello

i'm also having this issue, here is my architecture :

  • cloud run
  • quarkus
  • mysql-socket-factory-connector-j-8 1.10.0 (just updated to 1.11.0 to test if feat: improve reliability of refresh operations #1147 fixes the issue, needs some time to observe)
  • mysql cloudsql 8.0 private instance
  • connection through Serverless VPC access instances

it happens randomly, i found corresponding logs on cloudsql logs :
CloudSQL Instance's IAM access denied for user xxxxx@xxxxx.iam.gserviceaccount.com: Error code :UNAUTHENTICATED Error Message :Failed to validate access token

@enocom
Copy link
Member Author

enocom commented Apr 3, 2023

Thanks @hanfi

If folks are seeing this bug, it would help to know:

  1. The size of your deployment
  2. The relative rate of these errors (1 out of 100 requests for example)
  3. What base container and application you're using

@hanfi
Copy link

hanfi commented Apr 4, 2023

  1. The size of your deployment

3 quarkus cloud runs (problem happens on the 3) use v1.10.0

  1. The relative rate of these errors (1 out of 100 requests for example)

image

  1. What base container and application you're using

registry.access.redhat.com/ubi8/openjdk-17:1.14-8 (quarkus docker image)

@enocom
Copy link
Member Author

enocom commented Apr 4, 2023

I just merged #1233 which will throw an exception if an auth token is empty or expired (the two leading hypotheses). That will go out in our next release on Tuesday and I hope it will narrow down the issue here.

@enocom
Copy link
Member Author

enocom commented Apr 11, 2023

v1.11.1 now has a check for an invalid token. I recommend upgrading to see if these occasional errors are caused by an invalid token.

@hanfi
Copy link

hanfi commented Apr 18, 2023

thx @enocom, i dont find the error in the logs anymore :D

thx a lot seem fixed for me.

@michae1T
Copy link

no incident since March 15th on our side (#1174 (comment)). no code changes. based on previous pattern I would have expected it to reproduce in under a month. I wonder if our issue is not directly tied to the library even if it may be an aggravating factor. we will release with 1.11.1 now and continue to monitor.

@enocom
Copy link
Member Author

enocom commented Apr 19, 2023

Thanks for the update folks. I'll leave this open for now.

@incomprendido
Copy link

The ability to reproduce this may depend on the version of google-auth-library in use, other maven dependencies could pull in an older version of this library resulting in a variation of the following issue when credentials.refresh() is called in SqlAdminApiFetcher.java googleapis/google-auth-library-java#692 (comment).

@enocom you may have fixed this with the RuntimeExceptions thrown in the validateAccessToken method added to SqlAdminApiFetcher.java in version 1.11.1. RuntimeException should cause the onFailure call back in performRefresh method of CloudSqlInstance.java to run which then results in a retry of the refresh action.

@enocom
Copy link
Member Author

enocom commented Apr 25, 2023

That's a good point. I've omitted any logging and it seems the Java Connector doesn't always let exceptions bubble up, so I might have both fixed and concealed the problem.

@vr
Copy link

vr commented Jun 2, 2023

Still no issues here, can we consider it safe now ?

@enocom
Copy link
Member Author

enocom commented Jun 2, 2023

Yes, I believe so. I'm leaving this open until I add logging just to make it clear which case we're hitting (bad token, expired token). Otherwise, this is effectively fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
7 participants