Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context cancellation ignored causing heavy CPU usage #370

Closed
jault3 opened this issue Nov 3, 2022 · 8 comments · Fixed by #372 or #368
Closed

Context cancellation ignored causing heavy CPU usage #370

jault3 opened this issue Nov 3, 2022 · 8 comments · Fixed by #372 or #368
Assignees
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jault3
Copy link
Contributor

jault3 commented Nov 3, 2022

Bug Description

I believe I found an issue with how the cloud-sql-go-connector instance refresh functionality handles errors, or more specifically, cancelled context errors.

I have a process that runs as a long-running web server and uses the cloud-sql-go-connector to connect to postgres Cloud SQL instances. I noticed extremely high CPU usage after running a few tests and then deleting the Cloud SQL instance that it was connecting to. After doing some profiling and looking through the cloud sql connector code, I think I have a decent understanding of the issue (please correct anything that is wrong).

The fetchMetadata function attempts to Get the instance details using the sqladmin library here. Which as expected, if the instance has been deleted, returns an error here. Tracing that back a few more functions, I believe it is called from scheduleRefresh in this section of code. It appears if there is any error returned from the refresh, it immediately tries to schedule another one, but since the instance is deleted, it will just enter into an endless refresh loop and attempts to hold on to the last known good connection. Should this be able to handle unrecoverable errors (like deleted instances) and return more gracefully? At first I thought this was the primary issue, but to me it raised a larger question: why was the connector still refreshing in the background at all since I closed the dialer and connection (see example code below)?

The dialer looked to be cancelled properly here which propagated to the instance close and would cancel the internal context.

However, this line returns an error if the context was cancelled. And this code always treats all errors the same (by scheduling another refresh).

I believe this needs to check if the returned error is context.Cancelled, and if it is, just return. If this is accurate, I would love to submit a PR!

Example code (or command)

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"net/http"
	"time"

	"cloud.google.com/go/cloudsqlconn"
	"github.com/jackc/pgx/v4/pgxpool"
)

const (
	username = "postgres"
	password = "TODO"
	project  = "TODO"
	region   = "TODO" // e.g. us-central1
	instance = "TODO"
)

func main() {
	handler := func(w http.ResponseWriter, req *http.Request) {
		msg := "success"
		if err := selectOne(); err != nil {
			msg = err.Error()
		}
		w.Write([]byte(msg))
	}

	http.HandleFunc("/", handler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func selectOne() error {
	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*2)
	defer cancel()

	dsn := fmt.Sprintf("user=%s password=%s dbname=postgres sslmode=disable", username, password)
	config, err := pgxpool.ParseConfig(dsn)
	if err != nil {
		return fmt.Errorf("failed to parse pgx config: %v", err)
	}

	dialer, err := cloudsqlconn.NewDialer(ctx)
	if err != nil {
		return fmt.Errorf("failed to initialize dialer: %v", err)
	}
	defer dialer.Close()

	connectionName := fmt.Sprintf("%s:%s:%s", project, region, instance)

	config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
		return dialer.Dial(ctx, connectionName)
	}

	conn, err := pgxpool.ConnectConfig(context.Background(), config)
	if err != nil {
		return fmt.Errorf("failed to connect: %w", err)
	}
	defer conn.Close()

	if _, err = conn.Exec(ctx, "select 1"); err != nil {
		return err
	}

	return nil
}

Stacktrace

n/a

How to reproduce

  1. Create a Cloud SQL postgres instance in the GCP console
  2. Run the above example code go run main.go (replacing the const values appropriately)
  3. Run curl -XPOST localhost:8080 to trigger the db connection
  4. Delete the Cloud SQL instance in the GCP console
  5. Wait some amount of time (when the ephemeral cert expires, which I believe is 1 hour) and monitor CPU usage of the program

Environment

  1. OS type and version: macOS 13.0 (this same issue occurs in a linux container deployed to a GKE cluster)
  2. Go version: go version go1.19.2 darwin/amd64
  3. Connector version: cloud.google.com/go/cloudsqlconn v1.0.1
@jault3 jault3 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Nov 3, 2022
@enocom enocom added the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Nov 3, 2022
@enocom
Copy link
Member

enocom commented Nov 3, 2022

Thanks for the report @jault3. If you're interested in submitting a PR, I'd be happy to work through it with you.

@kurtisvg
Copy link
Contributor

kurtisvg commented Nov 3, 2022

@enocom and I chatted about this a bit and decided that this section:

select {
case <-i.ctx.Done():
// instance has been closed, don't schedule anything
return
default:
}
nextRefresh := i.cur.expiry.Add(-refreshBuffer)
i.next = i.scheduleRefresh(time.Until(nextRefresh))
})

needs to be moved up ahead of the "immediate" reschedule (as of right now it only exits the "normal" reschedule path). We wanted to give you first crack at it @jault3 if you'd like.

It would be great to have a test for this as well if possible.

@jault3
Copy link
Contributor Author

jault3 commented Nov 3, 2022

Awesome! I'll get started on a fix with a test

@liufuyang
Copy link

liufuyang commented Nov 5, 2022

On our cloudrun application we saw this error once in a while.

rpc error: code = Internal desc = database read error: failed to connect to host=localhost user=xxx@xxx.iam database=mydb: dial error (timeout: refresh failed: context deadline exceeded)

Perhaps related to this?

@kurtisvg
Copy link
Contributor

kurtisvg commented Nov 7, 2022

@liufuyang Cloud Run uses v1 of the proxy, so I think this is unrelated.

If you have a support contract, I'd recommend opening a case and asking them to look into those errors - sounds like a network connectivity problem.

@liufuyang
Copy link

@kurtisvg Thank you. Just to be clear, we didn't use the cloudrun's proxy, what we used is a db with external ip and run the go code in cloudrun with this repo's package to open proxy onto the db with IAM user.

Before creating support tickets, we are also thinking about doing these actions to help solving the dail/refresh error:

  • move the cloudrun application from us-west to us-east, as our db is in europe
  • switch from "external ip + proxy" setup to an "internal ip + proxy" setup (as we use IAM user so need the proxy to help with IAM user auth)

Do you see these actions could potentially help resolve this error on our side?

And do you think it is possible to use internal ip without any proxy but still uses IAM user login? (I am not sure if there is an easy way to get the IAM user password if we don't use proxy and connect to DB directly)

@kurtisvg
Copy link
Contributor

kurtisvg commented Nov 7, 2022

@liufuyang If you are using the go connector directly and not the /cloudsql unix socket, you can open a new issue here with as much detail as possible to help us diagnose.

I would need to peruse the code again to confirm, but context cancellation means that either you passed a context that timed out, or the default internal context for refreshes timed out.

@liufuyang
Copy link

Thanks a lot. We will try to perform some mitigation on our side such as connecting via private ip to see if it can help with the problem. If needed we can write it up on another issue page. Sorry for bringing some noise here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
4 participants