Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracer] using channels to send complete consistent traces #78

Merged
merged 38 commits into from
Jun 19, 2017

Conversation

ufoot
Copy link
Member

@ufoot ufoot commented Jun 14, 2017

This is a variant of #77 it achieves the same thing (avoid sending partial, incomplete traces) only it does it with a deeper rewriting. The major change is: it uses a channel to send traces finished spans. Quick overview:

  • tracer.NewRootSpan -> now creates a trace-bound in-memory buffer (Array) which should contain pointers to all spans belonging to this trace
  • tracer.NewChildSpan -> registers the child trace in the in-memory buffer above
  • span.Finish() -> now calls buffer.Flush()
  • buffer.AckFinish() -> called after every span.Finish(), checks if all spans in the buffer or finished, if yes, put all spans into a global, shared channel
  • tracer.worker() -> no more fills local buffers with traces or spans. Instead, only treats special events such as forced flush, and of course regular, time-based flushing. The low-level flush functions now locally empty the channels and send the data right away, suppressing the need for a persistent, local, mutex protected array.
  • tracer.flush() -> called on a regular basis, empties the tracer-bound buffer, sends data to agent (we still need this to bulk agent calls and mitigate protocol overhead)

The trace channel is "non-blocking" (when writing to it, uses select/default combo to ensure it never affects user code) so at some point, under heavy heavy load, we could technically loose/ignore traces. It's going to show up in the logs anyway, and has to be a really heavy load (globally, more than 1000 traces per lib<->agent roundtrip, so, in practice, lib or agent being saturated).

IMPORTANT NOTICE, this patch deprecates:

  • SetSpansBufferSize -> does not really make sense any more, there are several buffers & channels now (for spans, traces, services, errors, ...) and this one is not a game changer any more.
  • FlushTraces has been renamed to flushTraces so becomes private. Instead, a ForceFlush method is provided, which is safe to use. It's not required in production, but it can be called, it won't harm, it's thread-safe. And flushes everything (not only traces).

@ufoot ufoot added the bug unintended behavior that has to be fixed label Jun 14, 2017
tracer/buffer.go Outdated
// and avoid re-allocing.
spans := sb.spans
sb.spans = nil
for _, span := range tb.spans {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of looping each time you try to do a doFlush, can't you store an int that counts the number of finished spans? even if looping here is not so expensive in the average case, we still have a loop and a RLock(). So to check if a buffer is flushable, you may:

  • when you add a Span in doPush(), increase the traceBuffer.finishedSpan counter (when a Span is finished, its state is immutable so it's safe to always increase that number)
  • when you want to check if a buffer is finished, you check if traceBuffer.finishedSpan == len(traceBuffer.spans)

makes sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the idea, indeed, checking number of finished spans with buffer length does make sense.

tracer/buffer.go Outdated
}

return &spansBuffer{maxSize: maxSize}
tb.spans = append(tb.spans, span)
Copy link
Contributor

@bmermet bmermet Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This array behaves a lot like a buffered channel. Switching to one would remove the need for an explicit lock and with Manu's suggestion of adding a counter it would still be possible to count the number of finished spans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this but at some point we really use a []*Span later in the pipeline. IE it's not a pipeline but rather something that is filled until it's "complete" and then flushed/sent at once. Typically, filling the trace buffer would require emptying that buffer item by item and pump it into the final array. Will give it a 2nd look anyway, thanks for suggesting.

@ufoot
Copy link
Member Author

ufoot commented Jun 16, 2017

Released this on staging for:

  • trace-stats-aggregator
  • trace-stats-query
  • trace-api

@vlad-mh
Copy link
Contributor

vlad-mh commented Jun 19, 2017

We've been running this version on prod for a couple of days now, looking fine.

@ufoot ufoot merged commit b841c48 into master Jun 19, 2017
ufoot added a commit that referenced this pull request Jun 19, 2017
This contains two PRs:

- #73 branch raphael/gocql (Cassandra support)
- #78 branch christian/flushchannel (fix partial traces bug)

Those 2 branches had conflicts, and the commit 6952e85 fixes
the interactions between them. Additionnally, it has been running
for 3 complete days on our prod env, so considered stable enough.
@ufoot ufoot added this to the 0.5.0 milestone Jun 19, 2017
@palazzem palazzem deleted the christian/flushchannel branch November 27, 2017 15:13
jdgordon pushed a commit to jdgordon/dd-trace-go that referenced this pull request May 31, 2022
…ataDog#78)

* Support the DD_ENTITY_ID envvar for container tagging

* Support the DD_AGENT_HOST and DD_DOGSTATSD_PORT envvars for autoconfiguration

* pr feedback

* doc

* don't use testing.Run, introduced in go 1.7

* golint

* change entity tag name

* Reword godoc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unintended behavior that has to be fixed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants