Add a "lazy strategy" option that prevents refreshes from happening in the background #992

johanblumenberg · 2022-09-21T17:36:29Z

Bug Description

Every now and then we see a failure to refresh the ephemeral certificate used to connect to Cloud SQL.
Most of the time this is just an annoying error in the log, but the service tries again and succeeds most of the time. This is still annoying, since it adds a lot of noise to our monitoring.
But every now and then also the retry is failing, and we lose traffic because we don't have any certificate to authenticate to the DB.

The error that we get is this:

Got more than one input failure. Logging failures after the first
java.lang.RuntimeException: [...] Failed to update metadata for Cloud SQL instance.
	at com.google.cloud.sql.core.CloudSqlInstance.addExceptionContext(CloudSqlInstance.java:598)
	at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:505)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Error writing to server
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:718)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:730)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1613)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
	at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)
	at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:152)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:460)
	... 9 more

Environment

OS type and version: Cloud Run Service
Java SDK version:

Base docker image is eclipse-temurin:11-jre-focal

openjdk 11.0.16.1 2022-08-12
OpenJDK Runtime Environment Temurin-11.0.16.1+1 (build 11.0.16.1+1)
OpenJDK 64-Bit Server VM Temurin-11.0.16.1+1 (build 11.0.16.1+1, mixed mode, sharing)

Socket Factory version: com.google.cloud.sql:jdbc-socket-factory-core:jar:1.6.3

The text was updated successfully, but these errors were encountered:

johanblumenberg · 2022-09-21T17:38:37Z

We have been in contact with the google support, but after 3 months the only sensible response that we have gotten so far is that it might be related to the fact that Cloud Run uses CPU throttling and does not support background tasks. Since the refresh is done periodically as a background task, this might be an issue.

This is the support ticket: https://console.cloud.google.com/support/cases/detail/v2/29992086?project=veritru-dev-332314

johanblumenberg · 2022-09-21T17:46:57Z

We have created a patch to remove the background process, and this seems to solve the problem. We have used the same patch both in our Cloud Run services and in our Cloud Functions, and so far it seems to work. We have not seen any failures in our logs since we applied the patch.

We have been running with this patch for 8 days now, with zero errors. Before applying the patch we could see failures every other day or so.
We have also seen that we can avoid the error in our Cloud Run services by enabling CPU always on, which also indicates that the problem is indeed related to the background process. Unfortunately this workaround is not available for Cloud Functions.

This patch solves the problem for us: truid-app/google-cloud-patch@bf94b6b

This solution is probably not ideal, because it completely disables the background process and refreshes the certificate on the thread that needs it while processing a request. In cases where you don't use CPU throttling you probably would like the certificate to be updated in the background, instead of adding a delay to the request processing.

kurtisvg · 2022-09-21T18:30:36Z

So to be clear, it doesn't look like you've removed the background process, but have just made it force a new refresh before an error has occurred rather than after. It's still probably happening in the background, but now it'll fail silently.

Maybe an ideal solution would be to offer some option to specify a retry strategy, and add a "lazy" option that retries as needed rather than automatically.

johanblumenberg · 2022-09-21T18:56:41Z

As I wrote, this is just a proposal. There are probably better ways to solve the problem.

So to be clear, it doesn't look like you've removed the background process, but have just made it force a new refresh before an error has occurred rather than after.

There is no background process. When there is no traffic towards our services, no refresh is done. I would see in the logs if a refresh happened, and there is none.

The constructor still schedules a single refresh operation, because the currentInstanceData and nextInstanceData member variables should not be null. Otherwise you would have to handle the special case where these variables are null on the first access. Once the first refresh finishes, no new job is scheduled automatically.

I removed the code that schedules another refresh automatically. Instead it is forced when you access the SSL data. So the pending refresh job only exists for a short time while it is being executed. It is never scheduled to run in the future.

It's still probably happening in the background, but now it'll fail silently.

It's not failing silently. We would still see the failed refresh in the logs, even if it doesn't cause any incoming traffic to fail. We have not seen any refresh failures after this patch was applied.

Maybe an ideal solution would be to offer some option to specify a retry strategy, and add a "lazy" option that retries as needed rather than automatically.

Yes, I think that would be a good idea

kurtisvg · 2022-10-18T23:03:22Z

Posting some rational here for why we consider this a P2 for now:

We believe it's a fairly rare occurrence: it occurs when the process is throttled after the refresh has already started but before it is allowed to complete. If the process is throttled before the refresh has started, it's unlikely to have enough CPU to be begin until the process is unthrottled.
It's a fairly invasive change: currently we use a 2 thread executor that's shared between all of the Cloud SQL instances. We need to decide how to handle that executor if we don't want to execute requests in the background. We also need to make the behavior configurable and persist in a logical way with the current behavior, which is preferred for most users. There's a minimum of a few weeks of work to come up with a design and verify it doesn't introduce any new issues.
A workaround is fairly simple: because a new refresh is triggered immediately after a scheduled refresh fails which blocks future connections, request a new connection should lead to a successful refresh. Trying a second time to grab a new connection should allow a refresh operation to complete successfully, e.g.:
a. Refresh operation starts
b. Process is throttled
c. Process is unthrottled
d. Refresh operation throws exception because the interaction with the Admin API has expired
e. A new refresh operation is scheduled, blocking future connection requests
f. App/Pool grabs a new connection, which blocks until the refresh is complete
g. Refresh operation completes -> connect attempt is successful
h. Request is complete, process is throttled again until the next request

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992.

#992)

chore: Refactor RefreshAheadConnectionInfoCache. Part of #992. The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992.

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992.

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992. WIP Refactor BaseConnectionInfoCache

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992.

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992. WIP Refactor BaseConnectionInfoCache chore: Refactor RefreshAheadConnectionInfoCache. Part of #992. The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992. WIP Refactor BaseConnectionInfoCache

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992.

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992. WIP Refactor BaseConnectionInfoCache chore: Refactor RefreshAheadConnectionInfoCache. Part of #992. The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992. WIP Refactor BaseConnectionInfoCache

…Part of #992.

This makes a number of refactoring changes to align the Java connector with other implementations. - Introduce a ConnectionInfoCache interface - Rename DefaultConnectionInfoCache to RefreshAheadConnectionInfoCache - Update and simplify instantiation logic of RefreshAheadConnectionInfoCache Part of #992

…Part of #992.

The lazy refresh strategy only refreshes credentials and certificate information when the application attempts to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable than the default background refresh strategy. Fixes #992

…#992.

… (#1965) Creating a new lazy refresh strategy will make connectors more reliable on Cloud Run and other serverless platforms. Fixes #992

johanblumenberg added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Sep 21, 2022

blunderbuss-gcf bot assigned shubha-rajan Sep 21, 2022

kurtisvg added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Sep 21, 2022

kurtisvg added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Oct 11, 2022

kurtisvg changed the title ~~Failed to update metadata for Cloud SQL instance~~ Add a "lazy strategy" option that prevents refreshes from happening in the background Oct 11, 2022

enocom mentioned this issue Mar 17, 2023

Cloud SQL IAM service account authentication failed for user #1174

Closed

enocom unassigned shubha-rajan May 22, 2023

enocom assigned hessjcg Jul 7, 2023

enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 7, 2023

enocom mentioned this issue Jul 26, 2023

Add support for lazy certificate refresh GoogleCloudPlatform/cloud-sql-python-connector#805

Closed

product-auto-label bot added the api: cloudsql label Aug 29, 2023

github-actions bot removed the api: cloudsql label Aug 29, 2023

product-auto-label bot added the api: cloudsql label Aug 30, 2023

github-actions bot removed the api: cloudsql label Aug 31, 2023

product-auto-label bot added the api: cloudsql label Sep 1, 2023

github-actions bot removed the api: cloudsql label Sep 1, 2023

product-auto-label bot added the api: cloudsql label Sep 2, 2023

github-actions bot removed the api: cloudsql label Sep 6, 2023

product-auto-label bot added the api: cloudsql label Sep 7, 2023

github-actions bot removed the api: cloudsql label Sep 8, 2023

product-auto-label bot added the api: cloudsql label Sep 9, 2023

github-actions bot removed the api: cloudsql label Sep 11, 2023

product-auto-label bot added the api: cloudsql label Sep 12, 2023

hessjcg added a commit that referenced this issue May 21, 2024

test: Adds integration test to ensure that Lazy Refresh works. (part of

d421641

#992)

hessjcg added a commit that referenced this issue May 23, 2024

chore: Introduce ConnectionInfoCache interface (part of #992).

b7493bc

hessjcg added a commit that referenced this issue May 28, 2024

chore: Introduce ConnectionInfoCache interface (part of #992).

93a57e8

hessjcg added a commit that referenced this issue May 28, 2024

ALT-1 test: Adds integration test to ensure that Lazy Refresh works. …

9579555

…Part of #992.

hessjcg added a commit that referenced this issue May 28, 2024

ALT-1 test: Adds integration test to ensure that Lazy Refresh works. …

6d8ed02

…Part of #992.

hessjcg added a commit that referenced this issue May 28, 2024

ALT-1 test: Adds integration test to ensure that Lazy Refresh works. …

8c364bf

…Part of #992.

hessjcg added a commit that referenced this issue May 29, 2024

ALT-1 test: Adds integration test to ensure that Lazy Refresh works. …

d80b972

…Part of #992.

hessjcg added a commit that referenced this issue May 29, 2024

ALT-1 test: Adds integration test to ensure that Lazy Refresh works. …

9cd710a

…Part of #992.

hessjcg added a commit that referenced this issue May 29, 2024

ALT-1 feat: Add lazy refresh strategy to the connector. Fixes #992.

10b3cf1

hessjcg added a commit that referenced this issue May 29, 2024

ALT-1 feat: Add lazy refresh strategy to the connector. Fixes #992.

e5d7f08

hessjcg added a commit that referenced this issue May 29, 2024

ALT-1 test: Adds integration test to ensure that Lazy Refresh works. …

5f73710

…Part of #992.

hessjcg added a commit that referenced this issue May 29, 2024

ALT-1 feat: Add lazy refresh strategy to the connector. Fixes #992.

fa7848e

hessjcg closed this as completed in #1990 May 29, 2024

hessjcg added a commit that referenced this issue May 29, 2024

test: Adds integration test to ensure that Lazy Refresh works. Part of …

9cf25fb

…#992.

hessjcg added a commit that referenced this issue May 29, 2024

test: Adds integration test of the Lazy Refresh strategy. Part of #992.

d57baeb

hessjcg added a commit that referenced this issue Jun 3, 2024

Merge branch 'main' into gh-992-lazy-refresh

a64d8a9

hessjcg added a commit that referenced this issue Jun 3, 2024

test: Adds integration test of the Lazy Refresh strategy. Part of #992.

57d7b83

hessjcg added a commit that referenced this issue Jun 3, 2024

test: Adds integration test of the Lazy Refresh strategy. Part of #992.…

a20b754

… (#1965) Creating a new lazy refresh strategy will make connectors more reliable on Cloud Run and other serverless platforms. Fixes #992

release-please bot mentioned this issue Jun 11, 2024

chore(main): release 1.19.0 #2017

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "lazy strategy" option that prevents refreshes from happening in the background #992

Add a "lazy strategy" option that prevents refreshes from happening in the background #992

johanblumenberg commented Sep 21, 2022

johanblumenberg commented Sep 21, 2022 •

edited

johanblumenberg commented Sep 21, 2022 •

edited

kurtisvg commented Sep 21, 2022

johanblumenberg commented Sep 21, 2022

kurtisvg commented Oct 18, 2022

Add a "lazy strategy" option that prevents refreshes from happening in the background #992

Add a "lazy strategy" option that prevents refreshes from happening in the background #992

Comments

johanblumenberg commented Sep 21, 2022

Bug Description

Environment

johanblumenberg commented Sep 21, 2022 • edited

johanblumenberg commented Sep 21, 2022 • edited

kurtisvg commented Sep 21, 2022

johanblumenberg commented Sep 21, 2022

kurtisvg commented Oct 18, 2022

johanblumenberg commented Sep 21, 2022 •

edited

johanblumenberg commented Sep 21, 2022 •

edited