-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instrumentation of exception stacktraces #6609
Conversation
When instrumenting exception on first occurence, we get the stacktrace of the exception and for each frame we will generate a special method probe (ExceptionProbe) based on LogProbe doing a snapshot at Exit. to place correctly the probes we are basing on StackTrace element that gives class, method and line number. for overloaded methods (with same name) the only way to disambiguate which method to instrument is to use the line number. We introduce then the ClassFileLines class to generalize the LineNumber information extracted from the parsing of a classfile by the ASM library and be reused for line instrumentation and method matching. previous LineMap class is then replaced by it. Created exception probe are kept by ExceptionProbeManager and must be merged with the Configuration coming from Remote Config. Therefore the ConfigurationUpdater class must now take into account that and also be able to be triggered not only by the Remote Config thread but also the thread coming from the exception captured added to the span. Applying configuration require now to be atomic between Configuration object, new transformer and retransformation of the classes impacted.
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 8 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.29.0-SNAPSHOT~adeb199dd6, baseline=1.29.0-SNAPSHOT~c34fdeaac9
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.057 s) : 0, 1056992
Total [baseline] (9.318 s) : 0, 9318123
Agent [candidate] (1.072 s) : 0, 1071640
Total [candidate] (9.359 s) : 0, 9359328
section appsec
Agent [baseline] (1.159 s) : 0, 1158717
Total [baseline] (9.47 s) : 0, 9469639
Agent [candidate] (1.155 s) : 0, 1155182
Total [candidate] (9.449 s) : 0, 9449081
section iast
Agent [baseline] (1.18 s) : 0, 1180364
Total [baseline] (9.678 s) : 0, 9678489
Agent [candidate] (1.184 s) : 0, 1183607
Total [candidate] (9.647 s) : 0, 9647140
section profiling
Agent [baseline] (1.277 s) : 0, 1277080
Total [baseline] (9.531 s) : 0, 9531009
Agent [candidate] (1.277 s) : 0, 1277316
Total [candidate] (9.565 s) : 0, 9565381
gantt
title petclinic - break down per module: candidate=1.29.0-SNAPSHOT~adeb199dd6, baseline=1.29.0-SNAPSHOT~c34fdeaac9
dateFormat X
axisFormat %s
section tracing
BytebuddyAgent [baseline] (666.891 ms) : 0, 666891
BytebuddyAgent [candidate] (675.905 ms) : 0, 675905
GlobalTracer [baseline] (295.836 ms) : 0, 295836
GlobalTracer [candidate] (300.152 ms) : 0, 300152
AppSec [baseline] (51.732 ms) : 0, 51732
AppSec [candidate] (52.638 ms) : 0, 52638
Remote Config [baseline] (686.264 µs) : 0, 686
Remote Config [candidate] (702.684 µs) : 0, 703
Telemetry [baseline] (7.519 ms) : 0, 7519
Telemetry [candidate] (7.544 ms) : 0, 7544
section appsec
BytebuddyAgent [baseline] (670.621 ms) : 0, 670621
BytebuddyAgent [candidate] (666.886 ms) : 0, 666886
GlobalTracer [baseline] (296.435 ms) : 0, 296435
GlobalTracer [candidate] (296.487 ms) : 0, 296487
AppSec [baseline] (149.83 ms) : 0, 149830
AppSec [candidate] (150.237 ms) : 0, 150237
Remote Config [baseline] (651.866 µs) : 0, 652
Remote Config [candidate] (651.329 µs) : 0, 651
Telemetry [baseline] (6.747 ms) : 0, 6747
Telemetry [candidate] (6.709 ms) : 0, 6709
section iast
BytebuddyAgent [baseline] (777.482 ms) : 0, 777482
BytebuddyAgent [candidate] (777.953 ms) : 0, 777953
GlobalTracer [baseline] (287.67 ms) : 0, 287670
GlobalTracer [candidate] (288.499 ms) : 0, 288499
AppSec [baseline] (53.862 ms) : 0, 53862
AppSec [candidate] (53.135 ms) : 0, 53135
IAST [baseline] (20.053 ms) : 0, 20053
IAST [candidate] (22.504 ms) : 0, 22504
Remote Config [baseline] (603.272 µs) : 0, 603
Remote Config [candidate] (619.723 µs) : 0, 620
Telemetry [baseline] (6.507 ms) : 0, 6507
Telemetry [candidate] (6.627 ms) : 0, 6627
section profiling
BytebuddyAgent [baseline] (665.485 ms) : 0, 665485
BytebuddyAgent [candidate] (667.285 ms) : 0, 667285
GlobalTracer [baseline] (381.532 ms) : 0, 381532
GlobalTracer [candidate] (381.27 ms) : 0, 381270
AppSec [baseline] (52.132 ms) : 0, 52132
AppSec [candidate] (52.106 ms) : 0, 52106
Remote Config [baseline] (662.376 µs) : 0, 662
Remote Config [candidate] (658.178 µs) : 0, 658
Telemetry [baseline] (8.174 ms) : 0, 8174
Telemetry [candidate] (7.399 ms) : 0, 7399
ProfilingAgent [baseline] (114.538 ms) : 0, 114538
ProfilingAgent [candidate] (114.034 ms) : 0, 114034
Profiling [baseline] (114.563 ms) : 0, 114563
Profiling [candidate] (114.058 ms) : 0, 114058
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 9 metrics, 16 unstable metrics.
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.29.0-SNAPSHOT~adeb199dd6, baseline=1.29.0-SNAPSHOT~c34fdeaac9
dateFormat X
axisFormat %s
section baseline
no_agent (1.355 ms) : 1336, 1374
. : milestone, 1355,
appsec (1.767 ms) : 1741, 1793
. : milestone, 1767,
iast (1.521 ms) : 1497, 1545
. : milestone, 1521,
profiling (1.5 ms) : 1475, 1525
. : milestone, 1500,
tracing (1.489 ms) : 1465, 1514
. : milestone, 1489,
section candidate
no_agent (1.347 ms) : 1328, 1366
. : milestone, 1347,
appsec (1.751 ms) : 1726, 1776
. : milestone, 1751,
iast (1.519 ms) : 1495, 1544
. : milestone, 1519,
profiling (1.583 ms) : 1557, 1609
. : milestone, 1583,
tracing (1.516 ms) : 1490, 1542
. : milestone, 1516,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.29.0-SNAPSHOT~adeb199dd6, baseline=1.29.0-SNAPSHOT~c34fdeaac9
dateFormat X
axisFormat %s
section baseline
no_agent (371.467 µs) : 350, 393
. : milestone, 371,
iast (470.831 µs) : 450, 492
. : milestone, 471,
iast_FULL (537.052 µs) : 516, 558
. : milestone, 537,
iast_GLOBAL (501.516 µs) : 481, 522
. : milestone, 502,
iast_HARDCODED_SECRET_DISABLED (466.863 µs) : 447, 487
. : milestone, 467,
iast_INACTIVE (451.116 µs) : 429, 473
. : milestone, 451,
iast_TELEMETRY_OFF (470.591 µs) : 450, 491
. : milestone, 471,
tracing (442.822 µs) : 422, 464
. : milestone, 443,
section candidate
no_agent (367.605 µs) : 347, 388
. : milestone, 368,
iast (475.848 µs) : 455, 497
. : milestone, 476,
iast_FULL (528.997 µs) : 508, 550
. : milestone, 529,
iast_GLOBAL (510.289 µs) : 488, 533
. : milestone, 510,
iast_HARDCODED_SECRET_DISABLED (469.045 µs) : 449, 489
. : milestone, 469,
iast_INACTIVE (438.336 µs) : 418, 459
. : milestone, 438,
iast_TELEMETRY_OFF (475.011 µs) : 454, 496
. : milestone, 475,
tracing (434.421 µs) : 414, 455
. : milestone, 434,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving on comment until we fix the hot-path issue.
this.instrumentation = instrumentation; | ||
this.transformerSupplier = transformerSupplier; | ||
this.serviceName = TagsHelper.sanitize(config.getServiceName()); | ||
this.sink = sink; | ||
this.finder = finder; | ||
this.exceptionProbeManager = exceptionProbeManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we make configuration updater know about source of the configuration.
I would say lets try to keep this extendable more easily.
One way to think about it is having
accept(source: string, configuration) {
configPerSource[source] = configruation;
newConfiguration = buildConfigurationFromAllSourcesAndApplyFilters();
applyNewConfiguration(newConfiguration);
}
Then we simply can add more sources like span-origin, local-config etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the idea, but let's have another source before make an abstraction out of it.
...a-agent/agent-debugger/src/main/java/com/datadog/debugger/instrumentation/ClassFileInfo.java
Outdated
Show resolved
Hide resolved
...nt/agent-debugger/src/main/java/com/datadog/debugger/exception/DefaultExceptionDebugger.java
Outdated
Show resolved
Hide resolved
@@ -27,6 +31,7 @@ public void handleException(Throwable t) { | |||
// TODO trigger send snapshots already captured | |||
} else { | |||
exceptionProbeManager.instrument(fingerprint, t.getStackTrace()); | |||
configurationUpdater.reapplyCurrentConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like we talked. I think that because "reapplyCurrentConfig" is blocking it should not be called inside the hot path of span.attachError.
Maybe this can stay here is configurationUpdater would reapply the configuration async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep will do later
if (changes.hasProbeRelatedChanges()) { | ||
LOGGER.info("Applying new probe configuration, changes: {}", changes); | ||
handleProbesChanges(changes); | ||
configurationLock.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one things we can get benefits from making is async is "batching".
assume we got multiple exception triggered and we calling applyConfiguration 3 time each time with few more probes.
This code can check if we are currently applying configuration, and if we do, just mark that the latestConfig needed to be applied. this could happen multiple times if retransformation taking along time, and every time it happen we basically skipping applying a old configuration that will no longer relevant or if lock is not ordered we could be in a case where we apply a configuration that is not the latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes with async and a queue we could conflate to discard intermediate configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
What Does This Do
When instrumenting exception on first occurence, we get the stacktrace of the exception and for each frame we will generate a special method probe (ExceptionProbe) based on LogProbe doing a snapshot at Exit. to place correctly the probes we are basing on StackTrace element that gives class, method and line number.
for overloaded methods (with same name) the only way to disambiguate which method to instrument is to use the line number. We introduce then the ClassFileLines class to generalize the LineNumber information extracted from the parsing of a classfile by the ASM library and be reused for line instrumentation and method matching. previous LineMap class is then replaced by it.
Created exception probe are kept by ExceptionProbeManager and must be merged with the Configuration coming from Remote Config. Therefore the ConfigurationUpdater class must now take into account that and also be able to be triggered not only by the Remote Config thread but also the thread coming from the exception captured added to the span. Applying configuration require now to be atomic between Configuration object, new transformer and retransformation of the classes impacted.
Motivation
Exception debugging feature
Additional Notes
Jira ticket: DEBUG-2068