Add config to capture stacktrace when a span duration exceeds threshold#1160
Conversation
5b20902 to
ca4c272
Compare
dougqh
left a comment
There was a problem hiding this comment.
Two main concerns...
1 - I think using StackTraceElement[] and filtering on the object would easier to follow than the current Exception printing & FilteringStringWriter approach
2 - Going forward, I think anomalies that are worth logging or tagging should also have a corresponding health metric
| } | ||
|
|
||
| private void addStacktraceIfThresholdExceeded() { | ||
| final long spanDurationStacktraceNanos = Config.get().getSpanDurationStacktraceNanos(); |
There was a problem hiding this comment.
Long term, I'd like to get away from the use of Singletons like Config -- but for now, I think this is fine.
|
|
||
| private void addStacktraceIfThresholdExceeded() { | ||
| final long spanDurationStacktraceNanos = Config.get().getSpanDurationStacktraceNanos(); | ||
| if (!isError() |
There was a problem hiding this comment.
I'd like to see a health metric for anomalous situations as well.
| // ensure a min duration of 1 | ||
| if (this.durationNano.compareAndSet(0, Math.max(1, durationNano))) { | ||
| log.debug("Finished: {}", this); | ||
| addStacktraceIfThresholdExceeded(); |
There was a problem hiding this comment.
This feels like an optional behavior that might be better handled through a Listener mechanism.
However, I'm fine with deferring that to a later PR.
|
|
||
| @Override | ||
| public void write(final char[] cbuf, final int off, final int len) throws IOException { | ||
| if ((off < 0) |
There was a problem hiding this comment.
I think requesting and traversing the StackTraceElement[] would be easier to follow that filtering Writer.
There was a problem hiding this comment.
I agree. It would be also useful to make the filter configurable. There is often a lot of framework code that can be ignored. I'd even consider including only frames from my code. It worked great in mjprof
There was a problem hiding this comment.
For simplicity, I'm not going to make it configurable in this PR... we can evaluate that later.
| private static final String DEFAULT_SPLIT_BY_TAGS = ""; | ||
| private static final int DEFAULT_PARTIAL_FLUSH_MIN_SPANS = 1000; | ||
| private static final int DEFAULT_SPAN_DURATION_STACKTRACE_MILLIS = | ||
| (int) TimeUnit.SECONDS.toMillis(1); |
There was a problem hiding this comment.
Depending on a use-case of course, but for regular web applications 1 second is a strict setting, and many users can be flooded.
There was a problem hiding this comment.
I was trying to find a balance between a config default that is useful, but not burdensome... My thought is that a stacktrace is expensive, but if a span is more than a second, then the cost of a stacktrace is likely not noticeable compared to the rest of the trace. Do you have any suggestions for a different config default?
ca4c272 to
77b2232
Compare
|
@dougqh @jkubrynski, I've updated the PR with your suggestions. Good idea about the |
(But only when span is not errored or finished on a different thread.) Use the following config: ``` -Ddd.trace.span.duration.stacktrace.millis=1000 ``` (One second is the default, 0 disables.)
This should help avoid too frequent of stacktraces when the configured threshold is a normal duration. Alternatively, we could make it a percentage above the average.
3d0370b to
2f64eda
Compare
(But only when span is not errored or finished on a different thread.)
Use the following config:
(One second is the default,
0disables.)