Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad_certificate errors intermittently preventing connection to cloud sql #1314

Closed
msammarco opened this issue Jun 2, 2023 · 22 comments · Fixed by #1391 or #1392
Closed

bad_certificate errors intermittently preventing connection to cloud sql #1314

msammarco opened this issue Jun 2, 2023 · 22 comments · Fixed by #1391 or #1392
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@msammarco
Copy link

msammarco commented Jun 2, 2023

Bug Description

When using hikari database connection pooling with cloud sql we're seeing intermittently, usually after a few hours an issue where connections become unavailable with a bad_certificate error. We see this with mysql and postgres cloud sql connections authenticating with username/password, and differing hikari configurations (most set with the default hikari config of maxLifetime of 30 mins).

Example code (or command)

No response

Stacktrace

....
Caused by: java.sql.SQLTransientConnectionException: HikariPool-3 - Connection is not available, request timed out after 30000ms.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162)
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
	at com.company.config.CloseableDataSourceHikari.getConnection(CloseableDataSourceHikari.java:25)
	at org.jdbi.v3.core.Jdbi.open(Jdbi.java:319)
	... 18 common frames omitted
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
	at org.postgresql.Driver.makeConnection(Driver.java:443)
	at org.postgresql.Driver.connect(Driver.java:297)
	at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
	at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:358)
	at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)
	at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:477)
	at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:725)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:711)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	... 3 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:358)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1510)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1425)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:426)
	at com.google.cloud.sql.core.CoreSocketFactory.createSslSocket(CoreSocketFactory.java:339)
	at com.google.cloud.sql.core.CoreSocketFactory.connect(CoreSocketFactory.java:201)
	at com.google.cloud.sql.postgres.SocketFactory.createSocket(SocketFactory.java:77)
	at org.postgresql.core.PGStream.createSocket(PGStream.java:231)
	at org.postgresql.core.PGStream.<init>(PGStream.java:98)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
	... 15 common frames omitted

Steps to reproduce?

  1. Connect to cloud sql either postgres or mysql using username/password ("Allow only SSL connections" does not need to be enabled in cloud sql)
  2. Let your app run for awhile usually the issue presents itself in a few hours or so. Unable to reproduce myself locally with differing hikari settings.

Environment

  1. OS type and version: Docker Linux arm64
  2. Java SDK version: 17
  3. Cloud SQL Java Socket Factory version: postgres-socket-factory 1.11.2 - jdbi-cloudsql 2.3.5

Additional Details

My initial reaction was it looks similar to the closed issue here #472

And is possibly related to this Draft PR.

My colleague has opened an incident with google support already for this issue, but I think it might be worthwhile to open an issue here too. Case 45213975

@msammarco msammarco added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jun 2, 2023
@enocom enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jun 2, 2023
@hessjcg
Copy link
Collaborator

hessjcg commented Jun 5, 2023

I'm investigating this error. I agree that it may be related to #1307, however it may also be a separate issue.
SSLHandshakeException: Received fatal alert: bad_certificate would occur when the java SSL socket client attempts to connect to the Cloud SQL server. The client uses a client certificate, and the server rejects the client certificate.

There are a number of cases that could cause this to occur, especially if the app has to run for a few hours, requiring the ephemeral certificate to be refreshed a few times.

@hessjcg
Copy link
Collaborator

hessjcg commented Jun 9, 2023

I have finished the work to shore up certificate refresh bugs, See #1307. Please try out the next version of the JDBC Connector and see if the problem goes away.

@hessjcg hessjcg closed this as completed Jun 9, 2023
@enocom
Copy link
Member

enocom commented Jun 9, 2023

FYI: We'll be cutting a release on Tuesday.

@Ragge-dev
Copy link

Hello @hessjcg. I was just about to open up a new issue when I saw this one, we are seeing this exact issue (same stacktrace, with error starting after app has run for a few hours) on both version 1.11.2 and 1.12.0 with version 1.11.1 being stable for us. Only difference is that we are using IAM to authenticate towards cloud sql.

We are suspecting that this commit in some way changed the behaviour of the certificate refresh logic to cause this error described by @msammarco, at least we can't find another relevant commit between versions 1.11.1 and 1.11.2.

Is this something you are able to look into? I currently have not reproduced it locally, although I could try.

@tomirio619
Copy link

As mentioned by @Ragge-dev, we also experience the exact same problem. Version 1.11.1 works fine

@enocom
Copy link
Member

enocom commented Jul 5, 2023

Let's re-open this issue in that case. We'll investigate.

@enocom enocom reopened this Jul 5, 2023
@enocom
Copy link
Member

enocom commented Jul 6, 2023

@Ragge-dev and @tomirio619 I understand it takes a few hours for this to manifest. How frequently do you see it and does it affect application connectivity?

@tomirio619
Copy link

tomirio619 commented Jul 7, 2023

After testing, I found that the version in which the problem was introduced is 1.11.2.
In the affected versions (> 1.11.1) , my (Spring Boot) app successfully connects to the database (using CloudSQL IAM auth) during application start.
Within an hour or two, the SSLHandshakeException/bad_certificate exception is thrown.
Except for a restart, the app is unable to (temporarily) recover from this "broken database connection" state.

@enocom enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 7, 2023
@Ragge-dev
Copy link

@enocom Sorry for the delay, what @tomirio619 describes is exactly our experience as well. The error always occurs a few hours after the app has started.

@enocom
Copy link
Member

enocom commented Jul 10, 2023

Where are you running the Java Connector? Is this all in GKE?

@enocom
Copy link
Member

enocom commented Jul 10, 2023

FYI I've added some debug logging here: #1348.

This should at least make it clear what's going wrong with our refresh operations.

@enocom
Copy link
Member

enocom commented Jul 10, 2023

We'll cut a release tomorrow which might fix this, as we've made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we'll work on reproducing the issue or confirming it's been fixed.

@tomirio619
Copy link

Where are you running the Java Connector? Is this all in GKE?

Yes

@tomirio619
Copy link

We'll cut a release tomorrow which might fix this, as we've made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we'll work on reproducing the issue or confirming it's been fixed.

Great, I will test it when it comes out

@hessjcg
Copy link
Collaborator

hessjcg commented Jul 11, 2023

I ran a simple Java application that runs connects and query every minute in GKE. I was unable to reproduce this exact error, but found an occasional problem during refresh where the root cause was: java.net.UnknownHostException: metadata.google.internal I saw UnknownHostException errors also for the host sqladmin.googleapis.com.

I will continue to run tests to see if this can be reproduced. A few more things I will try:

  • use a private IP
  • leave connections idle for a 90 minute period between attempts to connect and queries

@Ragge-dev
Copy link

Ragge-dev commented Jul 13, 2023

Thank you for continuing to look into this! I tried an app yesterday with the latest version (1.13.0) and the problem (bad certificate) appeared after ~4 hours, so it's still there. I'll try again today to see if I can see your debug logging, but I'll be leaving for vacation after this week so will be my last attempt for a while.

Might be relevant information that I realize I have not shared, we only experience this issue on apps which are connected to multiple databases (postgres in this case). Any app only connected to one database works fine with the latest versions.

hessjcg added a commit to GoogleCloudPlatform/cloud-sql-proxy-operator that referenced this issue Jul 13, 2023
@hessjcg
Copy link
Collaborator

hessjcg commented Jul 13, 2023

Thank you for sharing that important detail. We will try to reproduce with two databases today. Are these apps connecting to multiple databases on the same Cloud SQL Postgres instance, or are they connecting to multiple Cloud SQL Postgres instances?

@enocom
Copy link
Member

enocom commented Jul 13, 2023

I imagine we can reproduce this with more connection pools to the same instance.

For now, it's best to stick with v1.11.1 and we'll work to fix this and cut a patch release as soon as we're convinced we've got the problem solved.

@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jul 13, 2023
@enocom
Copy link
Member

enocom commented Jul 13, 2023

I've been running a little Spring Boot app that connects to two databases in one instance in my GKE cluster. My app uses two separate connection pools and has Hikari setup to refresh all connections every minute (with the thought that connection creation will trigger this bug). After half a day, I don't see the bad certificate yet. I'm going to downgrade this to a p2 to signal that this bug isn't as pervasive as we original thought. But we're still working on identifying the root cause.

@tomirio619
Copy link

tomirio619 commented Jul 17, 2023

Might be relevant information that I realize I have not shared, we only experience this issue on apps which are connected to multiple databases (postgres in this case). Any app only connected to one database works fine with the latest versions.

Interesting, in our case the affected app is also connecting to multiple different CloudSQL instances

hessjcg added a commit that referenced this issue Jul 18, 2023
To refresh a Cloud SQL Instance's certificates, the current algorithm uses 2 threads from the thread pool for each
instance. Because the thread pool is fixed size, if a user configures 2 or more instances, they were at risk of a 
causing thread starvation and a deadlock.

This increases the thread count to a safe level so that most users will never experience a deadlock.

Fixes #1314
@enocom
Copy link
Member

enocom commented Jul 18, 2023

Hey folks, we found an edge case where the connector when configured with multiple instances would flood the thread pool in such a way that the threads would be unable to progress. We've identified a few related improvements in #1391 and #1390. We'll cut a patch release this week for people affected by this.

hessjcg added a commit that referenced this issue Jul 20, 2023
Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing.

New forceRefresh() logic ensures that:

Only 1 refresh cycle may run at a time.
If a refresh cycle is in progress, then it will not be canceled until it succeeds.
Add new test cases to validate race conditions, deadlocks, and concurrency.

Add additional logging to help diagnose production problems with certificate refresh.

Related to #1314

Fixes #1209
Fixes #1159
@enocom
Copy link
Member

enocom commented Jul 21, 2023

This has been fixed in v1.13.1. Thank you for all your help debugging this folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
5 participants