Skip to content

Add config to capture stacktrace when a span duration exceeds threshold#1160

Merged
tylerbenson merged 3 commits into
masterfrom
tyler/stacktrace
Mar 12, 2020
Merged

Add config to capture stacktrace when a span duration exceeds threshold#1160
tylerbenson merged 3 commits into
masterfrom
tyler/stacktrace

Conversation

@tylerbenson
Copy link
Copy Markdown
Contributor

@tylerbenson tylerbenson commented Jan 3, 2020

(But only when span is not errored or finished on a different thread.)

Use the following config:

-Ddd.trace.span.duration-above-average.stacktrace.millis=1000

(One second is the default, 0 disables.)

@tylerbenson tylerbenson requested a review from a team as a code owner January 3, 2020 01:40
Copy link
Copy Markdown
Contributor

@dougqh dougqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main concerns...
1 - I think using StackTraceElement[] and filtering on the object would easier to follow than the current Exception printing & FilteringStringWriter approach
2 - Going forward, I think anomalies that are worth logging or tagging should also have a corresponding health metric

}

private void addStacktraceIfThresholdExceeded() {
final long spanDurationStacktraceNanos = Config.get().getSpanDurationStacktraceNanos();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term, I'd like to get away from the use of Singletons like Config -- but for now, I think this is fine.

Comment thread dd-trace-ot/src/main/java/datadog/opentracing/DDSpan.java Outdated

private void addStacktraceIfThresholdExceeded() {
final long spanDurationStacktraceNanos = Config.get().getSpanDurationStacktraceNanos();
if (!isError()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a health metric for anomalous situations as well.

// ensure a min duration of 1
if (this.durationNano.compareAndSet(0, Math.max(1, durationNano))) {
log.debug("Finished: {}", this);
addStacktraceIfThresholdExceeded();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an optional behavior that might be better handled through a Listener mechanism.
However, I'm fine with deferring that to a later PR.


@Override
public void write(final char[] cbuf, final int off, final int len) throws IOException {
if ((off < 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think requesting and traversing the StackTraceElement[] would be easier to follow that filtering Writer.

Copy link
Copy Markdown

@jkubrynski jkubrynski Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be also useful to make the filter configurable. There is often a lot of framework code that can be ignored. I'd even consider including only frames from my code. It worked great in mjprof

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, I'm not going to make it configurable in this PR... we can evaluate that later.

private static final String DEFAULT_SPLIT_BY_TAGS = "";
private static final int DEFAULT_PARTIAL_FLUSH_MIN_SPANS = 1000;
private static final int DEFAULT_SPAN_DURATION_STACKTRACE_MILLIS =
(int) TimeUnit.SECONDS.toMillis(1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on a use-case of course, but for regular web applications 1 second is a strict setting, and many users can be flooded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to find a balance between a config default that is useful, but not burdensome... My thought is that a stacktrace is expensive, but if a span is more than a second, then the cost of a stacktrace is likely not noticeable compared to the rest of the trace. Do you have any suggestions for a different config default?

@tylerbenson
Copy link
Copy Markdown
Contributor Author

@dougqh @jkubrynski, I've updated the PR with your suggestions. Good idea about the StackTraceElement. That was much easier and should have been my first solution. I've added a todo for the future to consider using the StackWalking api.

(But only when span is not errored or finished on a different thread.)

Use the following config:
```
-Ddd.trace.span.duration.stacktrace.millis=1000
```
(One second is the default, 0 disables.)
This should help avoid too frequent of stacktraces when the configured threshold is a normal duration.
Alternatively, we could make it a percentage above the average.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants