Fix encoder pool issues #116

gabsn · 2017-10-02T22:57:10Z

After several race conditions / unexpected panics due to reusing the underlying buffers via a pool of encoders, we need to make the following changes:

allocating a new encoder each time we flush (= a new bytes.Buffer) => performance overhead, but necessary since neither the encoder nor the bytes.Buffer is thread-safe.
removing the encoder pool who was not used anyway, since the worker was working synchronously (flush traces -> encode traces -> send the encoder -> return the encoder -> new cycle for the worker).

Note: Since we only flush every 2s, allocating a new encoder at every flush is acceptable.

TODO

This PR needs to be merged to avoid the apps using our go tracer from panicking.
However, we should work on a way to improve our tracer efficiency.
Here are some ideas:

make the worker asynchronous, ie sending traces/services from a new goroutine (but make sure we don't introduce new race conditions)
encoding the traces/services into a thread-safe struct that can be sent over the network, to be able to reuse the encoders and avoid allocated a new encoder anytime we flush. But make sure we don't introduce any race condition.

Note: We should also work on a way to precisely measure the performance impact of our tracer on real apps in production (I tried to use out-of-the-box expvar metrics, but this was not conclusive).

Issues this PR solves

Unexpected panic

panic: runtime error: slice bounds out of range

goroutine 18 [running]:
bytes.(*Buffer).Bytes(...)
        /home/vagrant/.gimme/versions/go1.9.linux.amd64/src/bytes/buffer.go:52
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*httpTransport).SendTraces(0xc4201380c0, 0xc438f4fb00, 0x8, 0x8, 0x0, 0x0, 0x0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/transport.go:117 +0x80d
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*Tracer).flushTraces(0xc4201341a0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:286 +0x33d
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*Tracer).flush(0xc4201341a0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:329 +0x2b
github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.(*Tracer).worker(0xc4201341a0)
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:352 +0x321
created by github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer.NewTracerTransport
        /home/vagrant/go/src/github.com/DataDog/dd-go/vendor/github.com/DataDog/dd-trace-go/tracer/tracer.go:87 +0x3d6

Race condition (already solved by #112)

==================
WARNING: DATA RACE
Write at 0x00c42005a300 by goroutine 7:
  bytes.(*Buffer).Truncate()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/bytes/buffer.go:71 +0x40
  bytes.(*Buffer).Reset()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/bytes/buffer.go:85 +0x41
  github.com/DataDog/dd-trace-go/tracer.(*msgpackEncoder).EncodeTraces()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/encoder.go:42 +0x50
  github.com/DataDog/dd-trace-go/tracer.(*httpTransport).SendTraces()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/transport.go:103 +0x21d
  github.com/DataDog/dd-trace-go/tracer.(*Tracer).flushTraces()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:275 +0x457
  github.com/DataDog/dd-trace-go/tracer.(*Tracer).flush()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:318 +0x38
  github.com/DataDog/dd-trace-go/tracer.(*Tracer).worker()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:346 +0x217

Previous write at 0x00c42005a300 by goroutine 276:
  [failed to restore the stack]

Goroutine 7 (running) created at:
  github.com/DataDog/dd-trace-go/tracer.NewTracerTransport()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:85 +0x5b5
  github.com/DataDog/dd-trace-go/tracer.NewTracer()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:61 +0x46
  github.com/DataDog/dd-trace-go/tracer.init()
      /home/vagrant/go/src/github.com/DataDog/dd-trace-go/tracer/tracer.go:374 +0x108
  main.init()
      /home/vagrant/go/src/github.com/DataDog/dd-go/apps/hms-resolver/stats.go:114 +0xd1

Goroutine 276 (finished) created at:
  net/http.(*Transport).dialConn()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/net/http/transport.go:1118 +0xc2c
  net/http.(*Transport).getConn.func4()
      /home/vagrant/.gimme/versions/go1.8.linux.amd64/src/net/http/transport.go:908 +0xa2
==================

ufoot

Changes requested on coding style, but nothing big. This only needs to be battle tested on heavy loads.

ufoot · 2017-11-27T15:08:13Z

tracer/encoder.go

-	JSON_ENCODER = iota
-	MSGPACK_ENCODER
+	jsonType = iota
+	msgpackType


I don't really see the point in having an int to describe the encoder type here. Just use the encoding as the key. Let the constants be jsonType = "application/json" and msgpackType = "application/msgpack", and use those directly. Using an int is not like using a value within a finite set so we always take the risk to pass -1 or 2, get out of bound, and return nil when calling contentType(int). So let's just use the encoding as the key, but still use the constants jsonType and msgpackType to avoid copy/pasting and typo errors.

ufoot · 2017-11-27T15:08:43Z

tracer/encoder.go

-	pool := &encoderPool{
-		encoderType: encoderType,
-		pool:        make(chan Encoder, size),
+func contentType(encoderType int) string {


I think this function can be removed, use encoderType as a key.

ufoot · 2017-11-27T15:28:02Z

tracer/transport.go

+	// can theoretically read the underlying buffer whereas the encoder has been returned to the pool.
+	// Since the underlying bytes.Buffer is not thread safe, this can make the app panicking.
+	// since this method will later on spawn a goroutine referencing this buffer.
+	// That's why we prefer the less performant yet SAFE implementation of allocating a new encoder every time we flush.


One way to get the encoder overhead would be to just write a unit test that... only allocates an encoder. Totally agree that if the order of magnitude of this happening is "every 2sec", we'd be fine. Our current impl. tends to have big payloads flush rather than small calls, from a high-level view, is fine.

I totally agree that we're good here. It's not a hot-path where if we have 1 millions of requests, we still allocate 1 encoder every 2 seconds. Good to keep this comment to explain why we undo that (premature) optimization.

bmermet · 2017-12-07T18:06:02Z

@ufoot and @palazzem I have checked locally that this PR solves the race condition in serialization. It think it's ready to be merged.

palazzem · 2017-12-12T10:18:25Z

tracer/transport.go

+	// can theoretically read the underlying buffer whereas the encoder has been returned to the pool.
+	// Since the underlying bytes.Buffer is not thread safe, this can make the app panicking.
+	// since this method will later on spawn a goroutine referencing this buffer.
+	// That's why we prefer the less performant yet SAFE implementation of allocating a new encoder every time we flush.


I totally agree that we're good here. It's not a hot-path where if we have 1 millions of requests, we still allocate 1 encoder every 2 seconds. Good to keep this comment to explain why we undo that (premature) optimization.

ufoot

Thanks for the update, GTM.

Co-Authored-By: cswatt <cecilia.watt@datadoghq.com>

gabsn force-pushed the gabin/fix-buffer branch from 25fb2be to 1c72bd9 Compare October 2, 2017 23:05

gabsn requested a review from palazzem October 2, 2017 23:07

gabsn force-pushed the gabin/fix-buffer branch 3 times, most recently from 9cb2eca to 3efa359 Compare October 4, 2017 21:30

palazzem added this to the 0.5.2 milestone Nov 27, 2017

palazzem added core enhancement quick change/addition that does not need full team approval labels Nov 27, 2017

palazzem requested a review from ufoot November 27, 2017 14:58

ufoot requested changes Nov 27, 2017

View reviewed changes

palazzem modified the milestones: 0.5.2, 0.6.1 Dec 6, 2017

bmermet force-pushed the gabin/fix-buffer branch 2 times, most recently from 9fcabc6 to 585deea Compare December 6, 2017 09:24

gabsn and others added 3 commits December 7, 2017 13:59

Reset the buffer before returning a new encoder

941f492

Remove encoder pool

403ef9e

Simplify the code and the factory interface

ea9fbc6

bmermet force-pushed the gabin/fix-buffer branch from 585deea to ea9fbc6 Compare December 7, 2017 12:59

palazzem approved these changes Dec 12, 2017

View reviewed changes

ufoot approved these changes Dec 12, 2017

View reviewed changes

palazzem merged commit 21bd9f7 into master Dec 12, 2017

palazzem deleted the gabin/fix-buffer branch December 12, 2017 10:32

jdgordon pushed a commit to jdgordon/dd-trace-go that referenced this pull request May 31, 2022

3.3.1 changelog (DataDog#116)

f965411

Co-Authored-By: cswatt <cecilia.watt@datadoghq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoder pool issues #116

Fix encoder pool issues #116

gabsn commented Oct 2, 2017 •

edited

Loading

ufoot left a comment

ufoot Nov 27, 2017

ufoot Nov 27, 2017

ufoot Nov 27, 2017

palazzem Dec 12, 2017

bmermet commented Dec 7, 2017

palazzem Dec 12, 2017

ufoot left a comment

Fix encoder pool issues #116

Fix encoder pool issues #116

Conversation

gabsn commented Oct 2, 2017 • edited Loading

TODO

Issues this PR solves

Unexpected panic

Race condition (already solved by #112)

ufoot left a comment

Choose a reason for hiding this comment

ufoot Nov 27, 2017

Choose a reason for hiding this comment

ufoot Nov 27, 2017

Choose a reason for hiding this comment

ufoot Nov 27, 2017

Choose a reason for hiding this comment

palazzem Dec 12, 2017

Choose a reason for hiding this comment

bmermet commented Dec 7, 2017

palazzem Dec 12, 2017

Choose a reason for hiding this comment

ufoot left a comment

Choose a reason for hiding this comment

gabsn commented Oct 2, 2017 •

edited

Loading