Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoder pool issues #116

Merged
merged 3 commits into from
Dec 12, 2017
Merged

Fix encoder pool issues #116

merged 3 commits into from
Dec 12, 2017

Conversation

gabsn
Copy link

@gabsn gabsn commented Oct 2, 2017

After several race conditions / unexpected panics due to reusing the underlying buffers via a pool of encoders, we need to make the following changes:

  • allocating a new encoder each time we flush (= a new bytes.Buffer) => performance overhead, but necessary since neither the encoder nor the bytes.Buffer is thread-safe.
  • removing the encoder pool who was not used anyway, since the worker was working synchronously (flush traces -> encode traces -> send the encoder -> return the encoder -> new cycle for the worker).

Note: Since we only flush every 2s, allocating a new encoder at every flush is acceptable.

TODO

This PR needs to be merged to avoid the apps using our go tracer from panicking.
However, we should work on a way to improve our tracer efficiency.
Here are some ideas:

  • make the worker asynchronous, ie sending traces/services from a new goroutine (but make sure we don't introduce new race conditions)
  • encoding the traces/services into a thread-safe struct that can be sent over the network, to be able to reuse the encoders and avoid allocated a new encoder anytime we flush. But make sure we don't introduce any race condition.

Note: We should also work on a way to precisely measure the performance impact of our tracer on real apps in production (I tried to use out-of-the-box expvar metrics, but this was not conclusive).

Issues this PR solves

Unexpected panic

panic: runtime error: slice bounds out of range

goroutine 18 [running]:
bytes.(*Buffer).Bytes(...)
        /home/vagrant/.gimme/versions/go1.9.linux.amd64/src/bytes/buffer.go:52
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*httpTransport).SendTraces(0xc4201380c0, 0xc438f4fb00, 0x8, 0x8, 0x0, 0x0, 0x0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/transport.go:117 +0x80d
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*Tracer).flushTraces(0xc4201341a0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:286 +0x33d
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*Tracer).flush(0xc4201341a0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:329 +0x2b
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*Tracer).worker(0xc4201341a0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:352 +0x321
created by github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.NewTracerTransport
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:87 +0x3d6

Race condition (already solved by #112)

==================
WARNING: DATA RACE
Write at 0x00c42005a300 by goroutine 7:
  bytes.(*Buffer).Truncate()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/bytes/buffer.go:71 +0x40
  bytes.(*Buffer).Reset()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/bytes/buffer.go:85 +0x41
  github.com/DataDog/dd-trace-go/tracer.(*msgpackEncoder).EncodeTraces()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/encoder.go:42 +0x50
  github.com/DataDog/dd-trace-go/tracer.(*httpTransport).SendTraces()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/transport.go:103 +0x21d
  github.com/DataDog/dd-trace-go/tracer.(*Tracer).flushTraces()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:275 +0x457
  github.com/DataDog/dd-trace-go/tracer.(*Tracer).flush()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:318 +0x38
  github.com/DataDog/dd-trace-go/tracer.(*Tracer).worker()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:346 +0x217

Previous write at 0x00c42005a300 by goroutine 276:
  [failed to restore the stack]

Goroutine 7 (running) created at:
  github.com/DataDog/dd-trace-go/tracer.NewTracerTransport()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:85 +0x5b5
  github.com/DataDog/dd-trace-go/tracer.NewTracer()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:61 +0x46
  github.com/DataDog/dd-trace-go/tracer.init()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:374 +0x108
  main.init()
      /home/vagrant/go/src/github.com/DataDog/dd-go/apps/hms-resolver/stats.go:114 +0xd1

Goroutine 276 (finished) created at:
  net/http.(*Transport).dialConn()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/net/http/transport.go:1118 +0xc2c
  net/http.(*Transport).getConn.func4()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/net/http/transport.go:908 +0xa2
==================

@gabsn gabsn requested a review from palazzem October 2, 2017 23:07
@gabsn gabsn force-pushed the gabin/fix-buffer branch 3 times, most recently from 9cb2eca to 3efa359 Compare October 4, 2017 21:30
@palazzem palazzem added this to the 0.5.2 milestone Nov 27, 2017
@palazzem palazzem added core enhancement quick change/addition that does not need full team approval labels Nov 27, 2017
@palazzem palazzem requested a review from ufoot November 27, 2017 14:58
Copy link
Member

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested on coding style, but nothing big. This only needs to be battle tested on heavy loads.

JSON_ENCODER = iota
MSGPACK_ENCODER
jsonType = iota
msgpackType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point in having an int to describe the encoder type here. Just use the encoding as the key. Let the constants be jsonType = "application/json" and msgpackType = "application/msgpack", and use those directly. Using an int is not like using a value within a finite set so we always take the risk to pass -1 or 2, get out of bound, and return nil when calling contentType(int). So let's just use the encoding as the key, but still use the constants jsonType and msgpackType to avoid copy/pasting and typo errors.

pool := &encoderPool{
encoderType: encoderType,
pool: make(chan Encoder, size),
func contentType(encoderType int) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function can be removed, use encoderType as a key.

// can theoretically read the underlying buffer whereas the encoder has been returned to the pool.
// Since the underlying bytes.Buffer is not thread safe, this can make the app panicking.
// since this method will later on spawn a goroutine referencing this buffer.
// That's why we prefer the less performant yet SAFE implementation of allocating a new encoder every time we flush.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to get the encoder overhead would be to just write a unit test that... only allocates an encoder. Totally agree that if the order of magnitude of this happening is "every 2sec", we'd be fine. Our current impl. tends to have big payloads flush rather than small calls, from a high-level view, is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that we're good here. It's not a hot-path where if we have 1 millions of requests, we still allocate 1 encoder every 2 seconds. Good to keep this comment to explain why we undo that (premature) optimization.

@palazzem palazzem modified the milestones: 0.5.2, 0.6.1 Dec 6, 2017
@bmermet bmermet force-pushed the gabin/fix-buffer branch 2 times, most recently from 9fcabc6 to 585deea Compare December 6, 2017 09:24
@bmermet
Copy link
Contributor

bmermet commented Dec 7, 2017

@ufoot and @palazzem I have checked locally that this PR solves the race condition in serialization. It think it's ready to be merged.

// can theoretically read the underlying buffer whereas the encoder has been returned to the pool.
// Since the underlying bytes.Buffer is not thread safe, this can make the app panicking.
// since this method will later on spawn a goroutine referencing this buffer.
// That's why we prefer the less performant yet SAFE implementation of allocating a new encoder every time we flush.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that we're good here. It's not a hot-path where if we have 1 millions of requests, we still allocate 1 encoder every 2 seconds. Good to keep this comment to explain why we undo that (premature) optimization.

Copy link
Member

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, GTM.

@palazzem palazzem merged commit 21bd9f7 into master Dec 12, 2017
@palazzem palazzem deleted the gabin/fix-buffer branch December 12, 2017 10:32
jdgordon pushed a commit to jdgordon/dd-trace-go that referenced this pull request May 31, 2022
Co-Authored-By: cswatt <cecilia.watt@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement quick change/addition that does not need full team approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants