Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-architect the client #91

Merged
merged 59 commits into from
Oct 15, 2019
Merged

re-architect the client #91

merged 59 commits into from
Oct 15, 2019

Conversation

arbll
Copy link
Member

@arbll arbll commented Jun 20, 2019

What does this PR do ?

The main goal of this PR is to fix a bunch of bad behaviors this client had that would result in drops, increased resource consumption on the agent and back pressure on the instrumented app. Since some big architectural changes were needed to fix most of them, I thought it would make sense to fix them all at the same time.
The goal is also to make the client performant OOB. This includes enabling buffering and setting sensible values for high throughput.

  • Makes buffering less bursty by sending buffers as soon as they fill up (previously we sent all of them at the same time after 100ms). Fixes Buffered client is inefficient #20.
  • Add a way to configure the maximum size of a single payload (was always 1432, the optimal size for local UDP). Fixes No way to change OptimalPayloadSize #67
  • Sets the default maximum size of a single payload to 1432 when using UDP and 8192 when using UDS.
  • Move the networking to a different Goroutine to avoid blocking the instrumented app / reduce latency (and remove the async UDS client which was basically the same thing but only for UDS).
  • Remove all allocations during the formating logic to avoid putting pressure on the instrumented app GC.

Architecture

image

The high level idea is:
Every-time the instrumented app makes a call to the client we format the metric and store it in the current buffer. When this buffer gets full (or we reach a timeout) it is forwarded to a queue of buffers that should be sent over UDP or UDS by the sender asap. A new buffer is then taken from a buffer pool and used as the current buffer. When the sender is done with the buffer we added to the queue, it puts it back in the buffer pool.

There are two main goals with this design:

  • Avoiding any dynamic allocations after the client initialization using free lists
  • Decoupling the calls made by the instrumented app from the networking logic (this is particularly important with UDS that blocks by default but UDP also takes some time to send and has a few blocking edge cases)

Disclaimer: I took some inspiration from https://github.com/smira/go-statsd with the architecture.

Upgrade notes

  • Sending a metric over UDS won't return an error if we fail to forward the datagram to the agent. We took this decision for two main reasons:
    • This made the UDS client blocking by default which is not desirable
    • This design was flawed if you used a buffer as only the call that actually sent the buffer would return an error
  • The buffered option has been removed as the client can only be buffered. If for some reason you need to have only one dogstatsd message per payload you can still use the WithMaxMessagesPerPayload option set to 1.
  • The asyncUDS option has been removed as the networking layer is now running in a separate Goroutine.

Benchmarks against master

This was not the main goal of this PR but we get some nice improvements in raw performance:

Result against the default configuration:

benchmark                      old ns/op     new ns/op     delta
BenchmarkStatsdUDP1-8          1940          145           -92.53%
BenchmarkStatsdUDP10-8         2212          144           -93.49%
BenchmarkStatsdUDP100-8        2234          234           -89.53%
BenchmarkStatsdUDP1000-8       2397          1088          -54.61%
BenchmarkStatsdUDP10000-8      2488          1181          -52.53%
BenchmarkStatsdUDP100000-8     2446          1058          -56.75%
BenchmarkStatsdUDP200000-8     2530          986           -61.03%
BenchmarkStatsdUDS1-8          4811          131           -97.28%
BenchmarkStatsdUDS10-8         6078          141           -97.68%
BenchmarkStatsdUDS100-8        6544          234           -96.42%
BenchmarkStatsdUDS1000-8       7519          1210          -83.91%
BenchmarkStatsdUDS10000-8      7611          1183          -84.46%
BenchmarkStatsdUDS100000-8     7664          981           -87.20%
BenchmarkStatsdUDS200000-8     7854          1005          -87.20%

benchmark                      old allocs     new allocs     delta
BenchmarkStatsdUDP1-8          1              0              -100.00%
BenchmarkStatsdUDP10-8         1              0              -100.00%
BenchmarkStatsdUDP100-8        1              0              -100.00%
BenchmarkStatsdUDP1000-8       1              0              -100.00%
BenchmarkStatsdUDP10000-8      1              0              -100.00%
BenchmarkStatsdUDP100000-8     1              0              -100.00%
BenchmarkStatsdUDP200000-8     1              0              -100.00%
BenchmarkStatsdUDS1-8          9              0              -100.00%
BenchmarkStatsdUDS10-8         9              0              -100.00%
BenchmarkStatsdUDS100-8        9              0              -100.00%
BenchmarkStatsdUDS1000-8       9              0              -100.00%
BenchmarkStatsdUDS10000-8      9              0              -100.00%
BenchmarkStatsdUDS100000-8     9              0              -100.00%
BenchmarkStatsdUDS200000-8     10             0              -100.00%

benchmark                      old bytes     new bytes     delta
BenchmarkStatsdUDP1-8          32            0             -100.00%
BenchmarkStatsdUDP10-8         32            0             -100.00%
BenchmarkStatsdUDP100-8        32            0             -100.00%
BenchmarkStatsdUDP1000-8       32            0             -100.00%
BenchmarkStatsdUDP10000-8      37            0             -100.00%
BenchmarkStatsdUDP100000-8     44            2             -95.45%
BenchmarkStatsdUDP200000-8     53            6             -88.68%
BenchmarkStatsdUDS1-8          584           1             -99.83%
BenchmarkStatsdUDS10-8         584           1             -99.83%
BenchmarkStatsdUDS100-8        584           1             -99.83%
BenchmarkStatsdUDS1000-8       585           1             -99.83%
BenchmarkStatsdUDS10000-8      597           2             -99.66%
BenchmarkStatsdUDS100000-8     622           5             -99.20%
BenchmarkStatsdUDS200000-8     648           7             -98.92%

Results against the "optimal" configuration (MAX_INT elements in the buffer, AsyncUDS):

benchmark                      old ns/op     new ns/op     delta
BenchmarkStatsdUDP1-8          361           145           -59.83%
BenchmarkStatsdUDP10-8         398           144           -63.82%
BenchmarkStatsdUDP100-8        464           234           -49.57%
BenchmarkStatsdUDP1000-8       1395          1088          -22.01%
BenchmarkStatsdUDP10000-8      1573          1181          -24.92%
BenchmarkStatsdUDP100000-8     1609          1058          -34.24%
BenchmarkStatsdUDP200000-8     1612          986           -38.83%
BenchmarkStatsdUDS1-8          344           131           -61.92%
BenchmarkStatsdUDS10-8         385           141           -63.38%
BenchmarkStatsdUDS100-8        425           234           -44.94%
BenchmarkStatsdUDS1000-8       1477          1210          -18.08%
BenchmarkStatsdUDS10000-8      1537          1183          -23.03%
BenchmarkStatsdUDS100000-8     1275          981           -23.06%
BenchmarkStatsdUDS200000-8     1485          1005          -32.32%

benchmark                      old allocs     new allocs     delta
BenchmarkStatsdUDP1-8          1              0              -100.00%
BenchmarkStatsdUDP10-8         1              0              -100.00%
BenchmarkStatsdUDP100-8        1              0              -100.00%
BenchmarkStatsdUDP1000-8       1              0              -100.00%
BenchmarkStatsdUDP10000-8      1              0              -100.00%
BenchmarkStatsdUDP100000-8     1              0              -100.00%
BenchmarkStatsdUDP200000-8     1              0              -100.00%
BenchmarkStatsdUDS1-8          1              0              -100.00%
BenchmarkStatsdUDS10-8         1              0              -100.00%
BenchmarkStatsdUDS100-8        1              0              -100.00%
BenchmarkStatsdUDS1000-8       1              0              -100.00%
BenchmarkStatsdUDS10000-8      1              0              -100.00%
BenchmarkStatsdUDS100000-8     1              0              -100.00%
BenchmarkStatsdUDS200000-8     1              0              -100.00%

benchmark                      old bytes     new bytes     delta
BenchmarkStatsdUDP1-8          65            0             -100.00%
BenchmarkStatsdUDP10-8         65            0             -100.00%
BenchmarkStatsdUDP100-8        65            0             -100.00%
BenchmarkStatsdUDP1000-8       66            0             -100.00%
BenchmarkStatsdUDP10000-8      68            0             -100.00%
BenchmarkStatsdUDP100000-8     73            2             -97.26%
BenchmarkStatsdUDP200000-8     78            6             -92.31%
BenchmarkStatsdUDS1-8          78            1             -98.72%
BenchmarkStatsdUDS10-8         78            1             -98.72%
BenchmarkStatsdUDS100-8        78            1             -98.72%
BenchmarkStatsdUDS1000-8       79            1             -98.73%
BenchmarkStatsdUDS10000-8      80            2             -97.50%
BenchmarkStatsdUDS100000-8     86            5             -94.19%
BenchmarkStatsdUDS200000-8     85            7             -91.76%

arbll and others added 30 commits May 30, 2019 16:27
it was only benchmarking a small chunck of the code and there were some
random stuff in there too
… ...

rely on the interface instead of the implementation.
I removed all the tests that were testing the implementation.
The main goal here is to detach the existing unit tests from the implementation
to be able to change the implementation and still test it with the old unit tests.

A few benchmarks that were not useful were also removed at the same time.
…ement)

setting the precision to 6 appears to be much more expensive even when
formating floats with more than 6 digits after the decimal point

using 6 digits was introduced here #32
most other open source projects seems to use -1

the only exception is timing which will often have a high number of digits after the decimal
point. let's keep it at 6 for it
increase the perf by 10-20%
Copy link

@KSerrania KSerrania left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR 👍, the new architecture seems neat!

Made a first pass with some comments & nits, will take a look again later.

statsd/format.go Outdated Show resolved Hide resolved
statsd/format.go Show resolved Hide resolved
statsd/options.go Outdated Show resolved Hide resolved
statsd/service_check.go Outdated Show resolved Hide resolved
statsd/service_check.go Outdated Show resolved Hide resolved
statsd/statsd.go Outdated Show resolved Hide resolved
statsd/statsd.go Outdated Show resolved Hide resolved
statsd/statsd.go Outdated Show resolved Hide resolved
statsd/options.go Outdated Show resolved Hide resolved
statsd/options.go Show resolved Hide resolved
@truthbk truthbk changed the title re-architecture the client re-architect the client Jul 19, 2019
Copy link
Member

@truthbk truthbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, never submitted this out. Just some comments, but this looks very nice.

statsd/buffer.go Outdated

func newStatsdBuffer(maxSize, maxElements int) *statsdBuffer {
return &statsdBuffer{
buffer: make([]byte, 0, maxSize*2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here explaining the *2 would be nice, even if you have to talk a little bit about allocations and internal slice behavior.

}
}

func (b *statsdBuffer) reset() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not thread-safe.... we'll have to synchronize access to the buffer (or maybe you've dealt with it cleverly somewhere else ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was intended, this struct is not thread safe. It's the role of what's using the struct to make sure it's not accessed concurrently

statsd/buffer_pool.go Show resolved Hide resolved
}

func appendWithoutNewlines(buffer []byte, s string) []byte {
// fastpath for strings without newlines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a fastpath? won't it do something really close to what you're doing below anyways? (just wondering if you'll be traversing the string twice unnecessarily).

Copy link
Member Author

@arbll arbll Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was ported from the old code. I was surprised as well but IndexByte has an optimized assembly implementation that is indeed faster in benchmarks https://golang.org/src/internal/bytealg/index_amd64.s

Optimizing for no newlines sounds good to me as this should be the normal case

statsd/format.go Show resolved Hide resolved
statsd/statsd.go Outdated
c.commands = c.commands[cmdsFlushed+1:]
// flush the current buffer. Lock must be held by caller.
// flushed buffer writen to the network asynchronously.
func (c *Client) flushLocked() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this name is confusing. I'd prefer something like flushUnsafe or something.

statsd/statsd.go Outdated
return ""
}

func (c *Client) writeMetric(m metric) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unsafe, right? I feel like this should be noted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an unsafe suffix

statsd/statsd.go Outdated
if b != '\n' {
buf = append(buf, b)
}
if err := c.Flush(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should an error flushing prevent us from closing the sender?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. Let's do a best effort flush and ignore the error ?

case b := <-p.pool:
return b
default:
return newStatsdBuffer(p.bufferMaxSize, p.bufferMaxElements)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got mixed feelings about this... if we're using a pool, expected behavior would be to pull the buffer from the pool, the pool allows setting a bound on the resources used. Wondering if we should set a flag, with respect to getting strict pool behavior, or allowing us to dynamically grow the pool. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with this as it allows for some flexibility if you have a use case with sudden bursts of metrics. I see this as being somewhat similar to sync.Pool but with a maximum number of element that should be kept and no cleanup on gc.

Copy link
Member

@truthbk truthbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️ nice!

statsd/event.go Show resolved Hide resolved
statsd/service_check.go Show resolved Hide resolved
statsd/statsd.go Show resolved Hide resolved
Copy link
Member

@truthbk truthbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes make sense to me! 👍

I think we're pretty much ready to go! Let's make sure all 3.x caveats (memory behavior, obsolete options, etc) are properly documented, and feel free to pull the trigger on this as soon as you're ready.

@arbll
Copy link
Member Author

arbll commented Oct 14, 2019

@becoded Thanks for the review. I am guessing the reason behind those changes is to add the NOOP client right ? The PR is already very large, let's move to #92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No way to change OptimalPayloadSize Buffered client is inefficient
4 participants