[tracer] using channels to send complete consistent traces #78

ufoot · 2017-06-14T10:09:03Z

This is a variant of #77 it achieves the same thing (avoid sending partial, incomplete traces) only it does it with a deeper rewriting. The major change is: it uses a channel to send traces finished spans. Quick overview:

tracer.NewRootSpan -> now creates a trace-bound in-memory buffer (Array) which should contain pointers to all spans belonging to this trace
tracer.NewChildSpan -> registers the child trace in the in-memory buffer above
span.Finish() -> now calls buffer.Flush()
buffer.AckFinish() -> called after every span.Finish(), checks if all spans in the buffer or finished, if yes, put all spans into a global, shared channel
tracer.worker() -> no more fills local buffers with traces or spans. Instead, only treats special events such as forced flush, and of course regular, time-based flushing. The low-level flush functions now locally empty the channels and send the data right away, suppressing the need for a persistent, local, mutex protected array.
tracer.flush() -> called on a regular basis, empties the tracer-bound buffer, sends data to agent (we still need this to bulk agent calls and mitigate protocol overhead)

The trace channel is "non-blocking" (when writing to it, uses select/default combo to ensure it never affects user code) so at some point, under heavy heavy load, we could technically loose/ignore traces. It's going to show up in the logs anyway, and has to be a really heavy load (globally, more than 1000 traces per lib<->agent roundtrip, so, in practice, lib or agent being saturated).

IMPORTANT NOTICE, this patch deprecates:

SetSpansBufferSize -> does not really make sense any more, there are several buffers & channels now (for spans, traces, services, errors, ...) and this one is not a game changer any more.
FlushTraces has been renamed to flushTraces so becomes private. Instead, a ForceFlush method is provided, which is safe to use. It's not required in production, but it can be called, it won't harm, it's thread-safe. And flushes everything (not only traces).

palazzem · 2017-06-14T13:57:32Z

tracer/buffer.go

-	// and avoid re-allocing.
-	spans := sb.spans
-	sb.spans = nil
+	for _, span := range tb.spans {


instead of looping each time you try to do a doFlush, can't you store an int that counts the number of finished spans? even if looping here is not so expensive in the average case, we still have a loop and a RLock(). So to check if a buffer is flushable, you may:

when you add a Span in doPush(), increase the traceBuffer.finishedSpan counter (when a Span is finished, its state is immutable so it's safe to always increase that number)

when you want to check if a buffer is finished, you check if traceBuffer.finishedSpan == len(traceBuffer.spans)

makes sense?

Thanks for the idea, indeed, checking number of finished spans with buffer length does make sense.

… logic

bmermet · 2017-06-14T15:10:57Z

tracer/buffer.go

 	}

-	return &spansBuffer{maxSize: maxSize}
+	tb.spans = append(tb.spans, span)


This array behaves a lot like a buffered channel. Switching to one would remove the need for an explicit lock and with Manu's suggestion of adding a counter it would still be possible to count the number of finished spans.

I considered this but at some point we really use a []*Span later in the pipeline. IE it's not a pipeline but rather something that is filled until it's "complete" and then flushed/sent at once. Typically, filling the trace buffer would require emptying that buffer item by item and pump it into the final array. Will give it a 2nd look anyway, thanks for suggesting.

…no span

ufoot · 2017-06-16T12:41:59Z

Released this on staging for:

trace-stats-aggregator
trace-stats-query
trace-api

vlad-mh · 2017-06-19T08:37:58Z

We've been running this version on prod for a couple of days now, looking fine.

This contains two PRs: - #73 branch raphael/gocql (Cassandra support) - #78 branch christian/flushchannel (fix partial traces bug) Those 2 branches had conflicts, and the commit 6952e85 fixes the interactions between them. Additionnally, it has been running for 3 complete days on our prod env, so considered stable enough.

…ataDog#78) * Support the DD_ENTITY_ID envvar for container tagging * Support the DD_AGENT_HOST and DD_DOGSTATSD_PORT envvars for autoconfiguration * pr feedback * doc * don't use testing.Run, introduced in go 1.7 * golint * change entity tag name * Reword godoc

ufoot added 6 commits June 8, 2017 18:28

[tracer] added a test (broken for now) ensuring flush is atomic

5d67ce4

[tracer] first attempt to have atomic flush (WIP, needs be rewritten)

22a2f95

[tracer] updated sql test to cope with new flush policy

635b007

[tracer] avoid double map lookup of span trace IDs when flushing

5975b0f

[tracer] rewrote tracer to use channels to buffer traces (WIP)

1fc0a4a

[tracer] test rewriting in progress, using channels to buffer traces

ddb6715

ufoot added the bug unintended behavior that has to be fixed label Jun 14, 2017

[tracer] more test fixes

fd38cfe

ufoot force-pushed the christian/flushchannel branch from 1a408d8 to fd38cfe Compare June 14, 2017 12:55

palazzem reviewed Jun 14, 2017

View reviewed changes

[tracer] introducing bulk buffer object to isolate and unit test that…

9dfaed6

… logic

bmermet reviewed Jun 14, 2017

View reviewed changes

ufoot added 11 commits June 14, 2017 17:58

[tracer] testing for number of finished spans instead of walking array

6a7edad

[tracer] fixed race condition in test

5060416

[tracer] refactored, got rid of useless bulk code

a8812b2

[tracer] code rid of test waiting code by introducing ForceFlush

635ec3c

[tracer] replaced old calls to FlushTraces by ForceFlush

e63ef37

[tracer] fixed lint issue (unreachable code)

1ed16bc

[tracer] bug-fix, child traces would not be flushed when context had …

551d5bc

…no span

[tracer] ForceFlush now returns an error when flush goes wrong

f022fb6

[tracer] refactored service handling, pumping channel on flush

54d4634

[tracer] fix flaky gin-gonic test

f3acd12

[tracer] fixed redis test since order of spans is not garanteed any more

6c25cb6

ufoot force-pushed the christian/flushchannel branch from 15bdf40 to 6c25cb6 Compare June 15, 2017 10:34

ufoot added 6 commits June 15, 2017 13:54

[tracer] fixed MySQL test since order of spans is not garanteed any more

bc34b50

[tracer] using typed errors and buffer them before writing them to log

403adde

[tracer] fixed grpc test, new flushing strategy did change behavior

85215b2

[tracer] added tests for errors management

9c02c97

[tracer] made internal errors... private

f87f91d

[tracer] using a dedicated error for 'no span buffer' problem

e1f4071

ufoot added 4 commits June 15, 2017 15:23

[tracer] making sure unexpected errors show up

de00b87

[tracer] reintroduced tracer disabling test

60388df

[tracer] restored old tests

b4911db

[tracer] added a test to try and raise race condition errors

caab10f

ufoot force-pushed the christian/flushchannel branch from 9bc99f5 to caab10f Compare June 15, 2017 15:28

ufoot added 8 commits June 15, 2017 17:42

[tracer] flushing more often when there are lots of traces

821b3e6

[tracer] more test reworking

0c6ff0e

[tracer] remove some timeout conditions in tests

33ec47f

[tracer] test nitpick

9902ced

[tracer] triggering flush when channels reach 50% of capacity

f7ff4c0

[tracer] never block user-code

46405a4

[tracer] restored no parent test

487f7e5

[tracer] added comments on channel len consts

0f2beb6

[tracer] fixing issue when child starts after parent finishes

b841c48

ufoot mentioned this pull request Jun 19, 2017

[tracer] flushing finished traces instead of random spans #77

Closed

vlad-mh approved these changes Jun 19, 2017

View reviewed changes

ufoot merged commit b841c48 into master Jun 19, 2017

ufoot added this to the 0.5.0 milestone Jun 19, 2017

palazzem deleted the christian/flushchannel branch November 27, 2017 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tracer] using channels to send complete consistent traces #78

[tracer] using channels to send complete consistent traces #78

ufoot commented Jun 14, 2017 •

edited

Loading

palazzem Jun 14, 2017

ufoot Jun 14, 2017

bmermet Jun 14, 2017 •

edited

Loading

ufoot Jun 14, 2017

ufoot commented Jun 16, 2017

vlad-mh commented Jun 19, 2017

[tracer] using channels to send complete consistent traces #78

[tracer] using channels to send complete consistent traces #78

Conversation

ufoot commented Jun 14, 2017 • edited Loading

palazzem Jun 14, 2017

Choose a reason for hiding this comment

ufoot Jun 14, 2017

Choose a reason for hiding this comment

bmermet Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

ufoot Jun 14, 2017

Choose a reason for hiding this comment

ufoot commented Jun 16, 2017

vlad-mh commented Jun 19, 2017

ufoot commented Jun 14, 2017 •

edited

Loading

bmermet Jun 14, 2017 •

edited

Loading