-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context cancellation ignored causing heavy CPU usage #370
Comments
Thanks for the report @jault3. If you're interested in submitting a PR, I'd be happy to work through it with you. |
@enocom and I chatted about this a bit and decided that this section: cloud-sql-go-connector/internal/cloudsql/instance.go Lines 303 to 311 in 62a8c11
needs to be moved up ahead of the "immediate" reschedule (as of right now it only exits the "normal" reschedule path). We wanted to give you first crack at it @jault3 if you'd like. It would be great to have a test for this as well if possible. |
Awesome! I'll get started on a fix with a test |
On our cloudrun application we saw this error once in a while.
Perhaps related to this? |
@liufuyang Cloud Run uses v1 of the proxy, so I think this is unrelated. If you have a support contract, I'd recommend opening a case and asking them to look into those errors - sounds like a network connectivity problem. |
@kurtisvg Thank you. Just to be clear, we didn't use the cloudrun's proxy, what we used is a db with external ip and run the go code in cloudrun with this repo's package to open proxy onto the db with IAM user. Before creating support tickets, we are also thinking about doing these actions to help solving the dail/refresh error:
Do you see these actions could potentially help resolve this error on our side? And do you think it is possible to use internal ip without any proxy but still uses IAM user login? (I am not sure if there is an easy way to get the IAM user password if we don't use proxy and connect to DB directly) |
@liufuyang If you are using the go connector directly and not the I would need to peruse the code again to confirm, but context cancellation means that either you passed a context that timed out, or the default internal context for refreshes timed out. |
Thanks a lot. We will try to perform some mitigation on our side such as connecting via private ip to see if it can help with the problem. If needed we can write it up on another issue page. Sorry for bringing some noise here. |
Bug Description
I believe I found an issue with how the cloud-sql-go-connector instance refresh functionality handles errors, or more specifically, cancelled context errors.
I have a process that runs as a long-running web server and uses the cloud-sql-go-connector to connect to postgres Cloud SQL instances. I noticed extremely high CPU usage after running a few tests and then deleting the Cloud SQL instance that it was connecting to. After doing some profiling and looking through the cloud sql connector code, I think I have a decent understanding of the issue (please correct anything that is wrong).
The
fetchMetadata
function attempts toGet
the instance details using the sqladmin library here. Which as expected, if the instance has been deleted, returns an error here. Tracing that back a few more functions, I believe it is called fromscheduleRefresh
in this section of code. It appears if there is any error returned from the refresh, it immediately tries to schedule another one, but since the instance is deleted, it will just enter into an endless refresh loop and attempts to hold on to the last known good connection. Should this be able to handle unrecoverable errors (like deleted instances) and return more gracefully? At first I thought this was the primary issue, but to me it raised a larger question: why was the connector still refreshing in the background at all since I closed the dialer and connection (see example code below)?The dialer looked to be cancelled properly here which propagated to the instance close and would cancel the internal context.
However, this line returns an error if the context was cancelled. And this code always treats all errors the same (by scheduling another refresh).
I believe this needs to check if the returned error is
context.Cancelled
, and if it is, just return. If this is accurate, I would love to submit a PR!Example code (or command)
Stacktrace
n/a
How to reproduce
go run main.go
(replacing theconst
values appropriately)curl -XPOST localhost:8080
to trigger the db connectionEnvironment
macOS 13.0
(this same issue occurs in a linux container deployed to a GKE cluster)go version go1.19.2 darwin/amd64
cloud.google.com/go/cloudsqlconn v1.0.1
The text was updated successfully, but these errors were encountered: