Context cancellation ignored causing heavy CPU usage #370

jault3 · 2022-11-03T07:55:33Z

Bug Description

I believe I found an issue with how the cloud-sql-go-connector instance refresh functionality handles errors, or more specifically, cancelled context errors.

I have a process that runs as a long-running web server and uses the cloud-sql-go-connector to connect to postgres Cloud SQL instances. I noticed extremely high CPU usage after running a few tests and then deleting the Cloud SQL instance that it was connecting to. After doing some profiling and looking through the cloud sql connector code, I think I have a decent understanding of the issue (please correct anything that is wrong).

The fetchMetadata function attempts to Get the instance details using the sqladmin library here. Which as expected, if the instance has been deleted, returns an error here. Tracing that back a few more functions, I believe it is called from scheduleRefresh in this section of code. It appears if there is any error returned from the refresh, it immediately tries to schedule another one, but since the instance is deleted, it will just enter into an endless refresh loop and attempts to hold on to the last known good connection. Should this be able to handle unrecoverable errors (like deleted instances) and return more gracefully? At first I thought this was the primary issue, but to me it raised a larger question: why was the connector still refreshing in the background at all since I closed the dialer and connection (see example code below)?

The dialer looked to be cancelled properly here which propagated to the instance close and would cancel the internal context.

However, this line returns an error if the context was cancelled. And this code always treats all errors the same (by scheduling another refresh).

I believe this needs to check if the returned error is context.Cancelled, and if it is, just return. If this is accurate, I would love to submit a PR!

Example code (or command)

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"net/http"
	"time"

	"cloud.google.com/go/cloudsqlconn"
	"github.com/jackc/pgx/v4/pgxpool"
)

const (
	username = "postgres"
	password = "TODO"
	project  = "TODO"
	region   = "TODO" // e.g. us-central1
	instance = "TODO"
)

func main() {
	handler := func(w http.ResponseWriter, req *http.Request) {
		msg := "success"
		if err := selectOne(); err != nil {
			msg = err.Error()
		}
		w.Write([]byte(msg))
	}

	http.HandleFunc("/", handler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func selectOne() error {
	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*2)
	defer cancel()

	dsn := fmt.Sprintf("user=%s password=%s dbname=postgres sslmode=disable", username, password)
	config, err := pgxpool.ParseConfig(dsn)
	if err != nil {
		return fmt.Errorf("failed to parse pgx config: %v", err)
	}

	dialer, err := cloudsqlconn.NewDialer(ctx)
	if err != nil {
		return fmt.Errorf("failed to initialize dialer: %v", err)
	}
	defer dialer.Close()

	connectionName := fmt.Sprintf("%s:%s:%s", project, region, instance)

	config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
		return dialer.Dial(ctx, connectionName)
	}

	conn, err := pgxpool.ConnectConfig(context.Background(), config)
	if err != nil {
		return fmt.Errorf("failed to connect: %w", err)
	}
	defer conn.Close()

	if _, err = conn.Exec(ctx, "select 1"); err != nil {
		return err
	}

	return nil
}

Stacktrace

n/a

How to reproduce

Create a Cloud SQL postgres instance in the GCP console
Run the above example code go run main.go (replacing the const values appropriately)
Run curl -XPOST localhost:8080 to trigger the db connection
Delete the Cloud SQL instance in the GCP console
Wait some amount of time (when the ephemeral cert expires, which I believe is 1 hour) and monitor CPU usage of the program

Environment

OS type and version: macOS 13.0 (this same issue occurs in a linux container deployed to a GKE cluster)
Go version: go version go1.19.2 darwin/amd64
Connector version: cloud.google.com/go/cloudsqlconn v1.0.1

The text was updated successfully, but these errors were encountered:

enocom · 2022-11-03T16:15:12Z

Thanks for the report @jault3. If you're interested in submitting a PR, I'd be happy to work through it with you.

kurtisvg · 2022-11-03T17:07:56Z

@enocom and I chatted about this a bit and decided that this section:

cloud-sql-go-connector/internal/cloudsql/instance.go

Lines 303 to 311 in 62a8c11

    
           	select { 
        
           	case <-i.ctx.Done(): 
        
           		// instance has been closed, don't schedule anything 
        
           		return 
        
           	default: 
        
           	} 
        
           	nextRefresh := i.cur.expiry.Add(-refreshBuffer) 
        
           	i.next = i.scheduleRefresh(time.Until(nextRefresh)) 
        
           })

needs to be moved up ahead of the "immediate" reschedule (as of right now it only exits the "normal" reschedule path). We wanted to give you first crack at it @jault3 if you'd like.

It would be great to have a test for this as well if possible.

jault3 · 2022-11-03T17:15:14Z

Awesome! I'll get started on a fix with a test

liufuyang · 2022-11-05T17:21:57Z

On our cloudrun application we saw this error once in a while.

rpc error: code = Internal desc = database read error: failed to connect to host=localhost user=xxx@xxx.iam database=mydb: dial error (timeout: refresh failed: context deadline exceeded)

Perhaps related to this?

kurtisvg · 2022-11-07T03:09:21Z

@liufuyang Cloud Run uses v1 of the proxy, so I think this is unrelated.

If you have a support contract, I'd recommend opening a case and asking them to look into those errors - sounds like a network connectivity problem.

liufuyang · 2022-11-07T11:38:07Z

@kurtisvg Thank you. Just to be clear, we didn't use the cloudrun's proxy, what we used is a db with external ip and run the go code in cloudrun with this repo's package to open proxy onto the db with IAM user.

Before creating support tickets, we are also thinking about doing these actions to help solving the dail/refresh error:

move the cloudrun application from us-west to us-east, as our db is in europe
switch from "external ip + proxy" setup to an "internal ip + proxy" setup (as we use IAM user so need the proxy to help with IAM user auth)

Do you see these actions could potentially help resolve this error on our side?

And do you think it is possible to use internal ip without any proxy but still uses IAM user login? (I am not sure if there is an easy way to get the IAM user password if we don't use proxy and connect to DB directly)

kurtisvg · 2022-11-07T15:33:27Z

@liufuyang If you are using the go connector directly and not the /cloudsql unix socket, you can open a new issue here with as much detail as possible to help us diagnose.

I would need to peruse the code again to confirm, but context cancellation means that either you passed a context that timed out, or the default internal context for refreshes timed out.

liufuyang · 2022-11-09T13:32:21Z

Thanks a lot. We will try to perform some mitigation on our side such as connecting via private ip to see if it can help with the problem. If needed we can write it up on another issue page. Sorry for bringing some noise here.

Fixes #370

jault3 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Nov 3, 2022

blunderbuss-gcf bot assigned enocom Nov 3, 2022

enocom added the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Nov 3, 2022

jault3 mentioned this issue Nov 7, 2022

fix: handle context cancellations during instance refresh #372

Merged

3 tasks

enocom closed this as completed in #372 Nov 9, 2022

enocom pushed a commit that referenced this issue Nov 9, 2022

fix: handle context cancellations during instance refresh (#372)

cdb59c7

Fixes #370

github-actions bot mentioned this issue Nov 9, 2022

chore(main): release 1.1.0 #368

Merged

liufuyang mentioned this issue Nov 11, 2022

refresh failed: context deadline exceeded #384

Closed

enocom mentioned this issue Dec 14, 2022

Connection failure due to Cloud SQL Admin API per user/minute quota #413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context cancellation ignored causing heavy CPU usage #370

Context cancellation ignored causing heavy CPU usage #370

jault3 commented Nov 3, 2022

enocom commented Nov 3, 2022 •

edited

Loading

kurtisvg commented Nov 3, 2022

jault3 commented Nov 3, 2022

liufuyang commented Nov 5, 2022 •

edited

Loading

kurtisvg commented Nov 7, 2022

liufuyang commented Nov 7, 2022

kurtisvg commented Nov 7, 2022

liufuyang commented Nov 9, 2022

Context cancellation ignored causing heavy CPU usage #370

Context cancellation ignored causing heavy CPU usage #370

Comments

jault3 commented Nov 3, 2022

Bug Description

Example code (or command)

Stacktrace

How to reproduce

Environment

enocom commented Nov 3, 2022 • edited Loading

kurtisvg commented Nov 3, 2022

jault3 commented Nov 3, 2022

liufuyang commented Nov 5, 2022 • edited Loading

kurtisvg commented Nov 7, 2022

liufuyang commented Nov 7, 2022

kurtisvg commented Nov 7, 2022

liufuyang commented Nov 9, 2022

enocom commented Nov 3, 2022 •

edited

Loading

liufuyang commented Nov 5, 2022 •

edited

Loading