`bad_certificate` errors intermittently preventing connection to cloud sql #1314

msammarco · 2023-06-02T17:28:32Z

Bug Description

When using hikari database connection pooling with cloud sql we're seeing intermittently, usually after a few hours an issue where connections become unavailable with a bad_certificate error. We see this with mysql and postgres cloud sql connections authenticating with username/password, and differing hikari configurations (most set with the default hikari config of maxLifetime of 30 mins).

Example code (or command)

No response

Stacktrace

....
Caused by: java.sql.SQLTransientConnectionException: HikariPool-3 - Connection is not available, request timed out after 30000ms.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162)
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
	at com.company.config.CloseableDataSourceHikari.getConnection(CloseableDataSourceHikari.java:25)
	at org.jdbi.v3.core.Jdbi.open(Jdbi.java:319)
	... 18 common frames omitted
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
	at org.postgresql.Driver.makeConnection(Driver.java:443)
	at org.postgresql.Driver.connect(Driver.java:297)
	at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
	at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:358)
	at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)
	at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:477)
	at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:725)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:711)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	... 3 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:358)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1510)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1425)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:426)
	at com.google.cloud.sql.core.CoreSocketFactory.createSslSocket(CoreSocketFactory.java:339)
	at com.google.cloud.sql.core.CoreSocketFactory.connect(CoreSocketFactory.java:201)
	at com.google.cloud.sql.postgres.SocketFactory.createSocket(SocketFactory.java:77)
	at org.postgresql.core.PGStream.createSocket(PGStream.java:231)
	at org.postgresql.core.PGStream.<init>(PGStream.java:98)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
	... 15 common frames omitted

Steps to reproduce?

Connect to cloud sql either postgres or mysql using username/password ("Allow only SSL connections" does not need to be enabled in cloud sql)
Let your app run for awhile usually the issue presents itself in a few hours or so. Unable to reproduce myself locally with differing hikari settings.

Environment

OS type and version: Docker Linux arm64
Java SDK version: 17
Cloud SQL Java Socket Factory version: postgres-socket-factory 1.11.2 - jdbi-cloudsql 2.3.5

Additional Details

My initial reaction was it looks similar to the closed issue here #472

And is possibly related to this Draft PR.

My colleague has opened an incident with google support already for this issue, but I think it might be worthwhile to open an issue here too. Case 45213975

The text was updated successfully, but these errors were encountered:

hessjcg · 2023-06-05T15:36:22Z

I'm investigating this error. I agree that it may be related to #1307, however it may also be a separate issue.
SSLHandshakeException: Received fatal alert: bad_certificate would occur when the java SSL socket client attempts to connect to the Cloud SQL server. The client uses a client certificate, and the server rejects the client certificate.

There are a number of cases that could cause this to occur, especially if the app has to run for a few hours, requiring the ephemeral certificate to be refreshed a few times.

hessjcg · 2023-06-09T20:41:49Z

I have finished the work to shore up certificate refresh bugs, See #1307. Please try out the next version of the JDBC Connector and see if the problem goes away.

enocom · 2023-06-09T20:50:58Z

FYI: We'll be cutting a release on Tuesday.

Ragge-dev · 2023-07-03T13:03:14Z

Hello @hessjcg. I was just about to open up a new issue when I saw this one, we are seeing this exact issue (same stacktrace, with error starting after app has run for a few hours) on both version 1.11.2 and 1.12.0 with version 1.11.1 being stable for us. Only difference is that we are using IAM to authenticate towards cloud sql.

We are suspecting that this commit in some way changed the behaviour of the certificate refresh logic to cause this error described by @msammarco, at least we can't find another relevant commit between versions 1.11.1 and 1.11.2.

Is this something you are able to look into? I currently have not reproduced it locally, although I could try.

tomirio619 · 2023-07-04T11:39:57Z

As mentioned by @Ragge-dev, we also experience the exact same problem. Version 1.11.1 works fine

enocom · 2023-07-05T13:47:45Z

Let's re-open this issue in that case. We'll investigate.

enocom · 2023-07-06T15:39:44Z

@Ragge-dev and @tomirio619 I understand it takes a few hours for this to manifest. How frequently do you see it and does it affect application connectivity?

tomirio619 · 2023-07-07T06:14:38Z

After testing, I found that the version in which the problem was introduced is 1.11.2.
In the affected versions (> 1.11.1) , my (Spring Boot) app successfully connects to the database (using CloudSQL IAM auth) during application start.
Within an hour or two, the SSLHandshakeException/bad_certificate exception is thrown.
Except for a restart, the app is unable to (temporarily) recover from this "broken database connection" state.

Ragge-dev · 2023-07-10T06:58:53Z

@enocom Sorry for the delay, what @tomirio619 describes is exactly our experience as well. The error always occurs a few hours after the app has started.

enocom · 2023-07-10T15:29:24Z

Where are you running the Java Connector? Is this all in GKE?

enocom · 2023-07-10T15:32:01Z

FYI I've added some debug logging here: #1348.

This should at least make it clear what's going wrong with our refresh operations.

enocom · 2023-07-10T15:33:05Z

We'll cut a release tomorrow which might fix this, as we've made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we'll work on reproducing the issue or confirming it's been fixed.

tomirio619 · 2023-07-11T05:56:29Z

Where are you running the Java Connector? Is this all in GKE?

Yes

tomirio619 · 2023-07-11T05:57:15Z

We'll cut a release tomorrow which might fix this, as we've made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we'll work on reproducing the issue or confirming it's been fixed.

Great, I will test it when it comes out

hessjcg · 2023-07-11T17:45:00Z

I ran a simple Java application that runs connects and query every minute in GKE. I was unable to reproduce this exact error, but found an occasional problem during refresh where the root cause was: java.net.UnknownHostException: metadata.google.internal I saw UnknownHostException errors also for the host sqladmin.googleapis.com.

I will continue to run tests to see if this can be reproduced. A few more things I will try:

use a private IP
leave connections idle for a 90 minute period between attempts to connect and queries

Ragge-dev · 2023-07-13T09:38:04Z

Thank you for continuing to look into this! I tried an app yesterday with the latest version (1.13.0) and the problem (bad certificate) appeared after ~4 hours, so it's still there. I'll try again today to see if I can see your debug logging, but I'll be leaving for vacation after this week so will be my last attempt for a while.

Might be relevant information that I realize I have not shared, we only experience this issue on apps which are connected to multiple databases (postgres in this case). Any app only connected to one database works fine with the latest versions.

see GoogleCloudPlatform/cloud-sql-jdbc-socket-factory#1314

hessjcg · 2023-07-13T15:53:25Z

Thank you for sharing that important detail. We will try to reproduce with two databases today. Are these apps connecting to multiple databases on the same Cloud SQL Postgres instance, or are they connecting to multiple Cloud SQL Postgres instances?

enocom · 2023-07-13T16:43:35Z

I imagine we can reproduce this with more connection pools to the same instance.

For now, it's best to stick with v1.11.1 and we'll work to fix this and cut a patch release as soon as we're convinced we've got the problem solved.

enocom · 2023-07-13T22:26:36Z

I've been running a little Spring Boot app that connects to two databases in one instance in my GKE cluster. My app uses two separate connection pools and has Hikari setup to refresh all connections every minute (with the thought that connection creation will trigger this bug). After half a day, I don't see the bad certificate yet. I'm going to downgrade this to a p2 to signal that this bug isn't as pervasive as we original thought. But we're still working on identifying the root cause.

tomirio619 · 2023-07-17T05:59:43Z

Might be relevant information that I realize I have not shared, we only experience this issue on apps which are connected to multiple databases (postgres in this case). Any app only connected to one database works fine with the latest versions.

Interesting, in our case the affected app is also connecting to multiple different CloudSQL instances

…esh churn #1314

To refresh a Cloud SQL Instance's certificates, the current algorithm uses 2 threads from the thread pool for each instance. Because the thread pool is fixed size, if a user configures 2 or more instances, they were at risk of a causing thread starvation and a deadlock. This increases the thread count to a safe level so that most users will never experience a deadlock. Fixes #1314

…esh churn #1314

enocom · 2023-07-18T20:27:19Z

Hey folks, we found an edge case where the connector when configured with multiple instances would flood the thread pool in such a way that the threads would be unable to progress. We've identified a few related improvements in #1391 and #1390. We'll cut a patch release this week for people affected by this.

Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing. New forceRefresh() logic ensures that: Only 1 refresh cycle may run at a time. If a refresh cycle is in progress, then it will not be canceled until it succeeds. Add new test cases to validate race conditions, deadlocks, and concurrency. Add additional logging to help diagnose production problems with certificate refresh. Related to #1314 Fixes #1209 Fixes #1159

enocom · 2023-07-21T22:18:14Z

This has been fixed in v1.13.1. Thank you for all your help debugging this folks.

see GoogleCloudPlatform/cloud-sql-jdbc-socket-factory#1314

msammarco added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jun 2, 2023

blunderbuss-gcf bot assigned hessjcg Jun 2, 2023

enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jun 2, 2023

hessjcg closed this as completed Jun 9, 2023

enocom reopened this Jul 5, 2023

enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 7, 2023

hessjcg added a commit to GoogleCloudPlatform/cloud-sql-proxy-operator that referenced this issue Jul 13, 2023

wip: attempt to reproduce cloud-sql-jdbc-socket-factory #1314

2b24804

see GoogleCloudPlatform/cloud-sql-jdbc-socket-factory#1314

enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jul 13, 2023

hessjcg mentioned this issue Jul 14, 2023

fix: Repair concurrency logic for refreshing certificates. #1389

Closed

hessjcg added a commit that referenced this issue Jul 17, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

42fb00b

…esh churn #1314

hessjcg mentioned this issue Jul 17, 2023

fix: remove race condition bug in refresh logic #1390

Merged

hessjcg added a commit that referenced this issue Jul 17, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

7e8d949

…esh churn #1314

hessjcg added a commit that referenced this issue Jul 17, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

3de9270

…esh churn #1314

hessjcg mentioned this issue Jul 18, 2023

fix: Increase threadpool count to avoid deadlocks #1391

Merged

hessjcg added a commit that referenced this issue Jul 18, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

257fca6

…esh churn #1314

hessjcg closed this as completed in #1391 Jul 18, 2023

infra-db-release-bot mentioned this issue Jul 18, 2023

chore(main): release 1.13.1 #1392

Merged

hessjcg added a commit that referenced this issue Jul 18, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

9ab3b93

…esh churn #1314

hessjcg added a commit that referenced this issue Jul 18, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

161983b

…esh churn #1314

hessjcg added a commit that referenced this issue Jul 18, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

c1c7763

…esh churn #1314

hessjcg added a commit that referenced this issue Jul 18, 2023

fix: Increase threadpool count to avoid deadlock and reduce forceRefr…

17953f8

…esh churn #1314

breun mentioned this issue Jul 24, 2023

Spring Cloud GCP's managed version for Cloud SQL JBDC Socket Factory is buggy GoogleCloudPlatform/spring-cloud-gcp#2049

Closed

hessjcg added a commit to GoogleCloudPlatform/cloud-sql-proxy-operator that referenced this issue Aug 1, 2023

wip: attempt to reproduce cloud-sql-jdbc-socket-factory #1314

3b74f2f

see GoogleCloudPlatform/cloud-sql-jdbc-socket-factory#1314

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bad_certificate` errors intermittently preventing connection to cloud sql #1314

`bad_certificate` errors intermittently preventing connection to cloud sql #1314

msammarco commented Jun 2, 2023 •

edited

Loading

hessjcg commented Jun 5, 2023 •

edited

Loading

hessjcg commented Jun 9, 2023

enocom commented Jun 9, 2023

Ragge-dev commented Jul 3, 2023

tomirio619 commented Jul 4, 2023

enocom commented Jul 5, 2023

enocom commented Jul 6, 2023

tomirio619 commented Jul 7, 2023 •

edited

Loading

Ragge-dev commented Jul 10, 2023

enocom commented Jul 10, 2023

enocom commented Jul 10, 2023

enocom commented Jul 10, 2023

tomirio619 commented Jul 11, 2023

tomirio619 commented Jul 11, 2023

hessjcg commented Jul 11, 2023

Ragge-dev commented Jul 13, 2023 •

edited

Loading

hessjcg commented Jul 13, 2023 •

edited

Loading

enocom commented Jul 13, 2023

enocom commented Jul 13, 2023 •

edited

Loading

tomirio619 commented Jul 17, 2023 •

edited

Loading

enocom commented Jul 18, 2023

enocom commented Jul 21, 2023

bad_certificate errors intermittently preventing connection to cloud sql #1314

bad_certificate errors intermittently preventing connection to cloud sql #1314

Comments

msammarco commented Jun 2, 2023 • edited Loading

Bug Description

Example code (or command)

Stacktrace

Steps to reproduce?

Environment

Additional Details

hessjcg commented Jun 5, 2023 • edited Loading

hessjcg commented Jun 9, 2023

enocom commented Jun 9, 2023

Ragge-dev commented Jul 3, 2023

tomirio619 commented Jul 4, 2023

enocom commented Jul 5, 2023

enocom commented Jul 6, 2023

tomirio619 commented Jul 7, 2023 • edited Loading

Ragge-dev commented Jul 10, 2023

enocom commented Jul 10, 2023

enocom commented Jul 10, 2023

enocom commented Jul 10, 2023

tomirio619 commented Jul 11, 2023

tomirio619 commented Jul 11, 2023

hessjcg commented Jul 11, 2023

Ragge-dev commented Jul 13, 2023 • edited Loading

hessjcg commented Jul 13, 2023 • edited Loading

enocom commented Jul 13, 2023

enocom commented Jul 13, 2023 • edited Loading

tomirio619 commented Jul 17, 2023 • edited Loading

enocom commented Jul 18, 2023

enocom commented Jul 21, 2023

`bad_certificate` errors intermittently preventing connection to cloud sql #1314

`bad_certificate` errors intermittently preventing connection to cloud sql #1314

msammarco commented Jun 2, 2023 •

edited

Loading

hessjcg commented Jun 5, 2023 •

edited

Loading

tomirio619 commented Jul 7, 2023 •

edited

Loading

Ragge-dev commented Jul 13, 2023 •

edited

Loading

hessjcg commented Jul 13, 2023 •

edited

Loading

enocom commented Jul 13, 2023 •

edited

Loading

tomirio619 commented Jul 17, 2023 •

edited

Loading