-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud SQL IAM service account authentication failed for user #1174
Comments
Running a Spring Boot app with Hikari on a GCE instance. Hikari is configured with 5 connections set to refresh every 60s. I've also patched the Java connector to run a refresh cycle every 60s. We'll see if we can force out these IAM authn errors. |
After running the above experiment for ~24 hours, I don't see any errors. Going to wrap this up in a container and run in on GKE to see if that helps. |
We specifically observed the issue only with Cloud Run instances' service accounts, and never with GCE. We're not using GKE in this context so I can't tell about that |
Observed this issue twice since January over three persistent cloud run nodes - cpu throttling disabled. Hikari pool size 10. Feb 10th - v1.8.3 - project A - single node failure Nodes recovered on their own after 12 and 17 minutes respectively. The failures were to previously functioning nodes - not newly started. No relevant problem indicators in metrics or logs around the time. We have another service with a similar configuration that has not yet encountered the issue running v1.8.3 with 4 persistent nodes across 2 projects. confirmed problem image based on service:
postgres:
Only theory so far is around system clock however since it infrequently repros not yet done extra investigation. |
Thanks, folks. I'll try to reproduce this on Cloud Run as well. |
It is still happening from time to time on our clusters, slight less with the v1.10.0 version than with the v1.8.x, but still, we have rolled back to classical authentication (non-IAM) on production system. |
Current hypothesis is that there's (another) race condition in the underlying auth library -- I'm working today on another reproduction attempt. |
After running my patched connector in GKE for 24 hours, I still don't see any errors. For reference, I'm running with a |
Pushing my debug app to Cloud Run with 1 CPU allocated and 1 minimum instance. |
this just reproed again - cloud run persistent - single node starting 10:45 ET |
Base image is |
yes - however in this version we have downgraded to 5 minutes prior to the exceptions there is a significant increase in logging activity along the lines of:
about 50 of them in total. a handful of requests complete normally interspersed with this logging then all requests begin to fail as the pool is exhausted I suppose. fwiw, the timestamp printed in the above trace roughly matches the timestamp in cloud run logs - same second. 10:39 ET - socket factory cycling begins + first "Cloud SQL IAM service account authentication failed" PG logs |
I assume that means you're seeing this:
Yes? |
Yes, correct |
After running the app on Cloud Run for 24 hours, I still don't see this error. For next steps, I'm going to switch base images to |
Fwiw, given what I have seen CPU load is not necessary to reproduce it. Although maybe it will agitate things. 24 hours is also a bit of a short timeframe. Maybe I can give you a container with its logic stripped out that you can run on a larger pool of nodes? |
Help reproducing would be much appreciated. Some customers have reported seeing this error on a much more frequent interval. For now, it's not clear what correlations there might be. |
I can share that we were seeing this issue~1000 times a day across ~14 Cloud Run instances before disabling CPU throttling in Cloud Run. After disabling CPU throttling, we are still seeing this ~200 times a day. So if you have a hard time reproducing, enable CPU throttling (on the assumption that the product should also reliably manage IAM connections for CPU throttled Cloud Run instances) |
very interesting that there is an increase correlating with cpu-throttling use. JVMs tend to struggle with GC under this configuration - leading back to CPU. the number of instances in your case will make it hard to detect but I am wondering if there is anything interesting on your P99/max for the effected service? is your CPU/memory usage high in general? @enocom ok, I'll see what I can do - maybe you can combine my image with @sscheible's app config |
Thanks for the additional information @sscheible and @michae1T. We've known that CPU throttling can cause problems in Cloud Run which is the motivation for #992. I wonder if GKE deployments are likewise resource constrained. |
In my case, this is clearly not resource constrained, it was happening more on application which get little traffic |
In that case, I'm going to adjust my approach and restore the original connector and just run that for a few days. After running the debug app in Cloud Run and GKE for ~week or so, I haven't seen a single occurrence of this error. Meanwhile, if folks could let me know what base container image they're using, that might help. |
@enocom I attempted to create a repo with a distillation of the problem app image down to its base components where all the business logic is replaced with it should build a image with the same base, jars, jvm params, server config/init, cloud sql connection. you should be able to inject any additional logging as appropriate. unfortunately though as highlighted before, the incident is fairly infrequent - you would need a pool of at least 120 nodes under the same circumstances to reproduce in 24 hours if you have the budget. according to our experience you should not actually need to query the service to reproduce. to me there is no indication that the size of the connection pool is important as all the connections eventually die during an incident. |
Wow! Thanks @michae1T! I'll try this out and hopefully see the issue. |
np - fixed the repo link and hopefully helpful. |
We're using amazoncorretto:17-alpine |
we are using |
So quick update -- after running on Cloud Run with 100 instances for almost a week, I'm not seeing this error. Instead of keep guessing, I'm going to make a change to the connector here to throw on empty tokens or expired tokens (the two most likely problems). From there we can further isolate this to a problem in the credentials code or possibly in the backend (although I've only heard about this in Java). |
yup, we never had the issue with our cloud-sql-proxy setup, but we have less of this than Java one |
Seeing very promising results with cloud-gcp-dependencies 3.4.7 (from 3.4.3), which includes cloud-sql-jdbc-socket-factory#1147, in addition to having CPU throttling disabled. After monitoring over the weekend It seems that this almost eliminates the issue (down from ~200/day to ~2/day, so another 2 orders of magnitude). N.B. we are still seeing at least some of these errors with 3.4.7 and throttling enabled, running some more tests to quantify |
hello i'm also having this issue, here is my architecture :
it happens randomly, i found corresponding logs on cloudsql logs : |
Thanks @hanfi If folks are seeing this bug, it would help to know:
|
I just merged #1233 which will throw an exception if an auth token is empty or expired (the two leading hypotheses). That will go out in our next release on Tuesday and I hope it will narrow down the issue here. |
v1.11.1 now has a check for an invalid token. I recommend upgrading to see if these occasional errors are caused by an invalid token. |
thx @enocom, i dont find the error in the logs anymore :D thx a lot seem fixed for me. |
no incident since March 15th on our side (#1174 (comment)). no code changes. based on previous pattern I would have expected it to reproduce in under a month. I wonder if our issue is not directly tied to the library even if it may be an aggravating factor. we will release with |
Thanks for the update folks. I'll leave this open for now. |
The ability to reproduce this may depend on the version of google-auth-library in use, other maven dependencies could pull in an older version of this library resulting in a variation of the following issue when credentials.refresh() is called in SqlAdminApiFetcher.java googleapis/google-auth-library-java#692 (comment). @enocom you may have fixed this with the RuntimeExceptions thrown in the validateAccessToken method added to SqlAdminApiFetcher.java in version 1.11.1. RuntimeException should cause the onFailure call back in performRefresh method of CloudSqlInstance.java to run which then results in a retry of the refresh action. |
That's a good point. I've omitted any logging and it seems the Java Connector doesn't always let exceptions bubble up, so I might have both fixed and concealed the problem. |
Still no issues here, can we consider it safe now ? |
Yes, I believe so. I'm leaving this open until I add logging just to make it clear which case we're hitting (bad token, expired token). Otherwise, this is effectively fixed. |
Bug Description
An otherwise valid configuration on occasion will result in
Cloud SQL IAM service account authentication failed for user
.The backend will log:
Failed to validate access token
(as opposed to access token expired).While customers have seen this error in Cloud Run (possibly suggesting a CPU-throttling issue with the background refresh), it also appears on GKE.
Updating to the latest version has not resolved these occasional errors.
Example code (or command)
No response
Stacktrace
No response
Steps to reproduce?
Environment
Additional Details
No response
The text was updated successfully, but these errors were encountered: