Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "lazy strategy" option that prevents refreshes from happening in the background #992

Closed
johanblumenberg opened this issue Sep 21, 2022 · 6 comments · Fixed by #1965, #1990 or #2017
Assignees
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@johanblumenberg
Copy link

Bug Description

Every now and then we see a failure to refresh the ephemeral certificate used to connect to Cloud SQL.
Most of the time this is just an annoying error in the log, but the service tries again and succeeds most of the time. This is still annoying, since it adds a lot of noise to our monitoring.
But every now and then also the retry is failing, and we lose traffic because we don't have any certificate to authenticate to the DB.

The error that we get is this:

Got more than one input failure. Logging failures after the first
java.lang.RuntimeException: [...] Failed to update metadata for Cloud SQL instance.
	at com.google.cloud.sql.core.CloudSqlInstance.addExceptionContext(CloudSqlInstance.java:598)
	at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:505)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Error writing to server
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:718)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:730)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1613)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
	at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)
	at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:152)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:460)
	... 9 more

Environment

  1. OS type and version: Cloud Run Service
  2. Java SDK version:

Base docker image is eclipse-temurin:11-jre-focal

openjdk 11.0.16.1 2022-08-12
OpenJDK Runtime Environment Temurin-11.0.16.1+1 (build 11.0.16.1+1)
OpenJDK 64-Bit Server VM Temurin-11.0.16.1+1 (build 11.0.16.1+1, mixed mode, sharing)

  1. Socket Factory version: com.google.cloud.sql:jdbc-socket-factory-core:jar:1.6.3
@johanblumenberg johanblumenberg added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Sep 21, 2022
@johanblumenberg
Copy link
Author

johanblumenberg commented Sep 21, 2022

We have been in contact with the google support, but after 3 months the only sensible response that we have gotten so far is that it might be related to the fact that Cloud Run uses CPU throttling and does not support background tasks. Since the refresh is done periodically as a background task, this might be an issue.

This is the support ticket: https://console.cloud.google.com/support/cases/detail/v2/29992086?project=veritru-dev-332314

@johanblumenberg
Copy link
Author

johanblumenberg commented Sep 21, 2022

We have created a patch to remove the background process, and this seems to solve the problem. We have used the same patch both in our Cloud Run services and in our Cloud Functions, and so far it seems to work. We have not seen any failures in our logs since we applied the patch.

We have been running with this patch for 8 days now, with zero errors. Before applying the patch we could see failures every other day or so.
We have also seen that we can avoid the error in our Cloud Run services by enabling CPU always on, which also indicates that the problem is indeed related to the background process. Unfortunately this workaround is not available for Cloud Functions.

This patch solves the problem for us: truid-app/google-cloud-patch@bf94b6b

This solution is probably not ideal, because it completely disables the background process and refreshes the certificate on the thread that needs it while processing a request. In cases where you don't use CPU throttling you probably would like the certificate to be updated in the background, instead of adding a delay to the request processing.

@kurtisvg
Copy link
Contributor

So to be clear, it doesn't look like you've removed the background process, but have just made it force a new refresh before an error has occurred rather than after. It's still probably happening in the background, but now it'll fail silently.

Maybe an ideal solution would be to offer some option to specify a retry strategy, and add a "lazy" option that retries as needed rather than automatically.

@kurtisvg kurtisvg added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Sep 21, 2022
@johanblumenberg
Copy link
Author

As I wrote, this is just a proposal. There are probably better ways to solve the problem.

So to be clear, it doesn't look like you've removed the background process, but have just made it force a new refresh before an error has occurred rather than after.

There is no background process. When there is no traffic towards our services, no refresh is done. I would see in the logs if a refresh happened, and there is none.

The constructor still schedules a single refresh operation, because the currentInstanceData and nextInstanceData member variables should not be null. Otherwise you would have to handle the special case where these variables are null on the first access. Once the first refresh finishes, no new job is scheduled automatically.

I removed the code that schedules another refresh automatically. Instead it is forced when you access the SSL data. So the pending refresh job only exists for a short time while it is being executed. It is never scheduled to run in the future.

It's still probably happening in the background, but now it'll fail silently.

It's not failing silently. We would still see the failed refresh in the logs, even if it doesn't cause any incoming traffic to fail. We have not seen any refresh failures after this patch was applied.

Maybe an ideal solution would be to offer some option to specify a retry strategy, and add a "lazy" option that retries as needed rather than automatically.

Yes, I think that would be a good idea

@kurtisvg kurtisvg added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Oct 11, 2022
@kurtisvg kurtisvg changed the title Failed to update metadata for Cloud SQL instance Add a "lazy strategy" option that prevents refreshes from happening in the background Oct 11, 2022
@kurtisvg
Copy link
Contributor

Posting some rational here for why we consider this a P2 for now:

  1. We believe it's a fairly rare occurrence: it occurs when the process is throttled after the refresh has already started but before it is allowed to complete. If the process is throttled before the refresh has started, it's unlikely to have enough CPU to be begin until the process is unthrottled.
  2. It's a fairly invasive change: currently we use a 2 thread executor that's shared between all of the Cloud SQL instances. We need to decide how to handle that executor if we don't want to execute requests in the background. We also need to make the behavior configurable and persist in a logical way with the current behavior, which is preferred for most users. There's a minimum of a few weeks of work to come up with a design and verify it doesn't introduce any new issues.
  3. A workaround is fairly simple: because a new refresh is triggered immediately after a scheduled refresh fails which blocks future connections, request a new connection should lead to a successful refresh. Trying a second time to grab a new connection should allow a refresh operation to complete successfully, e.g.:
    a. Refresh operation starts
    b. Process is throttled
    c. Process is unthrottled
    d. Refresh operation throws exception because the interaction with the Admin API has expired
    e. A new refresh operation is scheduled, blocking future connection requests
    f. App/Pool grabs a new connection, which blocks until the refresh is complete
    g. Refresh operation completes -> connect attempt is successful
    h. Request is complete, process is throttled again until the next request

hessjcg added a commit that referenced this issue May 21, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.
hessjcg added a commit that referenced this issue May 23, 2024
chore: Refactor RefreshAheadConnectionInfoCache. Part of #992.

The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.
hessjcg added a commit that referenced this issue May 23, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.
hessjcg added a commit that referenced this issue May 23, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache
hessjcg added a commit that referenced this issue May 23, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.
hessjcg added a commit that referenced this issue May 23, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache

chore: Refactor RefreshAheadConnectionInfoCache. Part of #992.

The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache
hessjcg added a commit that referenced this issue May 23, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.
hessjcg added a commit that referenced this issue May 24, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache

chore: Refactor RefreshAheadConnectionInfoCache. Part of #992.

The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache
hessjcg added a commit that referenced this issue May 28, 2024
The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache

chore: Refactor RefreshAheadConnectionInfoCache. Part of #992.

The lazy refresh strategy only refreshes credentials and certificate information when
the application attempts to establish a new database connection. On Cloud Run
and other serverless runtimes, this is more reliable than the default background
refresh strategy.   

Fixes #992.

WIP Refactor BaseConnectionInfoCache
hessjcg added a commit that referenced this issue May 29, 2024
This makes a number of refactoring changes to align the Java connector with other implementations.

- Introduce a ConnectionInfoCache interface
- Rename DefaultConnectionInfoCache to RefreshAheadConnectionInfoCache
- Update and simplify instantiation logic of RefreshAheadConnectionInfoCache

Part of #992
hessjcg added a commit that referenced this issue May 29, 2024
The lazy refresh strategy only refreshes credentials and certificate information when the application attempts
to establish a new database connection. On Cloud Run and other serverless runtimes, this is more reliable
than the default background refresh strategy.

Fixes #992
hessjcg added a commit that referenced this issue Jun 3, 2024
… (#1965)

Creating a new lazy refresh strategy will make connectors more reliable on Cloud Run and other serverless platforms.

Fixes #992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
6 participants