Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgwire: remove readTimeoutConn in favor of a channel #124373

Merged
merged 2 commits into from
May 20, 2024

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented May 17, 2024

Rather than using a connection that polls for the context being done every second, we now spin up an additional goroutine that blocks until the connection context is done, or the drain signal was received.

I wrote a simple benchmark of the idle connection loop to generate CPU profiles.

With the old readTimeoutConn:
image

With the new goroutine approach:
image

There's definitely less noise and overhead.

fixes #25585
Release note: None

@rafiss rafiss requested a review from yuzefovich May 17, 2024 23:01
@rafiss rafiss requested review from a team as code owners May 17, 2024 23:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! The change overall seems good to me.

I'm not too familiar with this code, so I want to double check my understanding. IIUC currently for each connection we have two goroutines:

  • the "reader" goroutine that continuously loops in serveImpl and see what the client is requesting us to do
  • and the "processor" goroutine for the connExecutor that executes commands added into stmtBuf by the "reader" goroutine.

This PR introduces a third "watchdog" goroutine that monitors the state for cancellation and draining. In other words, we make a trade-off between 1) periodically forcing net.Conn to wake up in the "reader" due to a read deadline to see whether the "reader" should quit and 2) a separate goroutine that only monitors things with no read deadline set for the "reader". In practice, this "watchdog" goroutine should be idle most of the time, so it should remove some system calls. Does this sound right?

@@ -945,6 +942,35 @@ func (s *Server) serveImpl(
authOpt authOptions,
sessionID clusterunique.ID,
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the last part of the comment for serveImpl needs an update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tcpConn, _ = underConn.(*net.TCPConn)
}
if tcpConn == nil {
_ = c.conn.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't close the connection in both directions before this patch. Based on the comment to serveImpl, it seems that its the connExecutor's job to do so. Why add this here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm good point. it seems safer to only close the read side.

this matches the previous behavior, where if the context was cancelled, then the readTimeoutConn would start returning errors when Read is called:

typ, n, err := c.readBuf.ReadTypedMsg(&c.rd)

that should cause the main loop of the reader goroutine to exit here:

// If we can't read data because of any one of the following conditions,
// then we should break:
// 1. the connection was closed.
// 2. the context was canceled (e.g. during authentication).
// 3. we reached an arbitrary threshold of repeated errors.
if netutil.IsClosedConnection(err) ||
errors.Is(err, context.Canceled) ||
repeatedErrorCount > maxRepeatedErrorCount {
break
}

Copy link
Collaborator Author

@rafiss rafiss May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, thinking about it more, the risk is that if the type switch no longer works as expected (like for example, there's a new implementation of net.Conn to deal with later), then not having a default case would make the connection hang forever.

maybe it's safer to keep the Close as a fallback? WDYT?

// DrainRequest. This will make the processor quit whenever it finds a
// good time.
_ /* err */ = c.stmtBuf.Push(ctx, sql.DrainRequest{})
<-ctx.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a comment (that eventually the drain request will be processed by the server and all connections will be closed, which is achieved by cancelling the context) would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a third "watchdog" goroutine that monitors the state for cancellation and draining. In other words, we make a trade-off between 1) periodically forcing net.Conn to wake up in the "reader" due to a read deadline to see whether the "reader" should quit and 2) a separate goroutine that only monitors things with no read deadline set for the "reader". In practice, this "watchdog" goroutine should be idle most of the time, so it should remove some system calls. Does this sound right?

yes sounds right. it's worth noting that a goroutine that does nothing apart from wait on a channel is very efficient in go. the scheduler should never try to run it.

@@ -945,6 +942,35 @@ func (s *Server) serveImpl(
authOpt authOptions,
sessionID clusterunique.ID,
) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// DrainRequest. This will make the processor quit whenever it finds a
// good time.
_ /* err */ = c.stmtBuf.Push(ctx, sql.DrainRequest{})
<-ctx.Done()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tcpConn, _ = underConn.(*net.TCPConn)
}
if tcpConn == nil {
_ = c.conn.Close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm good point. it seems safer to only close the read side.

this matches the previous behavior, where if the context was cancelled, then the readTimeoutConn would start returning errors when Read is called:

typ, n, err := c.readBuf.ReadTypedMsg(&c.rd)

that should cause the main loop of the reader goroutine to exit here:

// If we can't read data because of any one of the following conditions,
// then we should break:
// 1. the connection was closed.
// 2. the context was canceled (e.g. during authentication).
// 3. we reached an arbitrary threshold of repeated errors.
if netutil.IsClosedConnection(err) ||
errors.Is(err, context.Canceled) ||
repeatedErrorCount > maxRepeatedErrorCount {
break
}

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/pgwire/server.go line 968 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

although, thinking about it more, the risk is that if the type switch no longer works as expected (like for example, there's a new implementation of net.Conn to deal with later), then not having a default case would make the connection hang forever.

maybe it's safer to keep the Close as a fallback? WDYT?

Sounds reasonable to me.

I traced where Close is called currently, and I think it's a few layers up in the defer in in the async task in netutil.TCPServer.ServeWith, so the comment (that "serveImpl always closes the network connection before returning") is misleading before your patch but will actually become closer to reality after the patch. It might be worth adjusting the comment.

@rafiss rafiss requested a review from yuzefovich May 20, 2024 18:14
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the updates haven't been pushed yet?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/pgwire/server.go line 968 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Sounds reasonable to me.

I traced where Close is called currently, and I think it's a few layers up in the defer in in the async task in netutil.TCPServer.ServeWith, so the comment (that "serveImpl always closes the network connection before returning") is misleading before your patch but will actually become closer to reality after the patch. It might be worth adjusting the comment.

I was about to push the change, then realized that another fallback option is to call conn.SetReadDeadline(now()), which might be less disruptive. I'll do that instead.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/pgwire/server.go line 968 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

I was about to push the change, then realized that another fallback option is to call conn.SetReadDeadline(now()), which might be less disruptive. I'll do that instead.

Nice, I like it.

Rather than using a connection that polls for the context being done
every second, we now spin up an additional goroutine that blocks until
the connection context is done, or the drain signal was received.

Release note: None
This benchmark was only useful for capturing a CPU profile, it doesn't
measure anything else.

Release note: None
@rafiss
Copy link
Collaborator Author

rafiss commented May 20, 2024

tftr!

bors r=yuzefovich

craig bot pushed a commit that referenced this pull request May 20, 2024
124373: pgwire: remove readTimeoutConn in favor of a channel r=yuzefovich a=rafiss

Rather than using a connection that polls for the context being done every second, we now spin up an additional goroutine that blocks until the connection context is done, or the drain signal was received.

I wrote a simple benchmark of the idle connection loop to generate CPU profiles.

With the old readTimeoutConn:
<img width="861" alt="image" src="https://github.com/cockroachdb/cockroach/assets/1320573/9e0c88fc-eda7-4a51-8060-280b0d95a7a7">

With the new goroutine approach:
<img width="567" alt="image" src="https://github.com/cockroachdb/cockroach/assets/1320573/535ea962-ec03-46f7-a63b-382c65553553">

There's definitely less noise and overhead.

fixes #25585
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented May 20, 2024

Build failed:

@rickystewart
Copy link
Collaborator

bors r=yuzefovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pgwire: replace deadline and connection polling with separate goroutine
4 participants