Limit DependencyResolverQueue size with a config flag #6729

nayeem-kamal · 2024-02-23T22:45:35Z

What Does This Do

Allows the user to set a limit to the size of the telemetry DependencyResolverQueue to prevent OOM errors when there are a large number of dependencies
Introduces a new config variable (dd.telemetry.dependency-resolution.queue.size) which takes an integer to set as the limit for the number of dependencies to store in the queue.

Motivation

Resolve APMS-11445

Additional Notes

Jira ticket: APMS-11445

pr-commenter · 2024-02-23T23:35:03Z

Benchmarks

Startup

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	nayeem-kamal/limit-telemetry-dependency-queue-size
git_commit_date	1708941753	1708977519
git_commit_sha	`ab5e982`	`856a99f`
release_version	1.31.0-SNAPSHOT~ab5e98242b	1.31.0-SNAPSHOT~856a99f46a

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1708980580	1708980580
ci_job_id	443705723	443705723
ci_pipeline_id	28959121	28959121
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module	Agent	Agent
parent	None	None
variant	iast	iast

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 41 metrics, 11 unstable metrics.

scenario	Δ mean execution_time	candidate mean execution_time	baseline mean execution_time
scenario:startup:petclinic:profiling:ProfilingAgent	worse [+15.346ms; +22.780ms] or [+16.574%; +24.604%]	111.650ms	92.587ms
scenario:startup:petclinic:profiling:Profiling	worse [+15.344ms; +22.778ms] or [+16.568%; +24.595%]	111.675ms	92.613ms

Startup time reports for petclinic

gantt
    title petclinic - global startup overhead: candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.077 s) : 0, 1077274
Total [baseline] (9.17 s) : 0, 9169625
Agent [candidate] (1.088 s) : 0, 1088103
Total [candidate] (9.197 s) : 0, 9196849
section appsec
Agent [baseline] (1.175 s) : 0, 1174688
Total [baseline] (9.385 s) : 0, 9384913
Agent [candidate] (1.185 s) : 0, 1185156
Total [candidate] (9.345 s) : 0, 9344991
section iast
Agent [baseline] (1.199 s) : 0, 1199246
Total [baseline] (9.389 s) : 0, 9388985
Agent [candidate] (1.202 s) : 0, 1201852
Total [candidate] (9.334 s) : 0, 9333752
section profiling
Agent [baseline] (1.271 s) : 0, 1271451
Total [baseline] (9.333 s) : 0, 9332762
Agent [candidate] (1.299 s) : 0, 1298551
Total [candidate] (9.355 s) : 0, 9355496

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.077 s	-
Agent	appsec	1.175 s	97.414 ms (9.0%)
Agent	iast	1.199 s	121.973 ms (11.3%)
Agent	profiling	1.271 s	194.177 ms (18.0%)
Total	tracing	9.17 s	-
Total	appsec	9.385 s	215.288 ms (2.3%)
Total	iast	9.389 s	219.36 ms (2.4%)
Total	profiling	9.333 s	163.137 ms (1.8%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.088 s	-
Agent	appsec	1.185 s	97.053 ms (8.9%)
Agent	iast	1.202 s	113.748 ms (10.5%)
Agent	profiling	1.299 s	210.448 ms (19.3%)
Total	tracing	9.197 s	-
Total	appsec	9.345 s	148.142 ms (1.6%)
Total	iast	9.334 s	136.903 ms (1.5%)
Total	profiling	9.355 s	158.647 ms (1.7%)

gantt
    title petclinic - break down per module: candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (685.796 ms) : 0, 685796
BytebuddyAgent [candidate] (692.747 ms) : 0, 692747
GlobalTracer [baseline] (297.426 ms) : 0, 297426
GlobalTracer [candidate] (300.432 ms) : 0, 300432
AppSec [baseline] (51.349 ms) : 0, 51349
AppSec [candidate] (51.633 ms) : 0, 51633
Remote Config [baseline] (707.729 µs) : 0, 708
Remote Config [candidate] (724.531 µs) : 0, 725
Telemetry [baseline] (7.692 ms) : 0, 7692
Telemetry [candidate] (7.928 ms) : 0, 7928
section appsec
BytebuddyAgent [baseline] (684.8 ms) : 0, 684800
BytebuddyAgent [candidate] (691.054 ms) : 0, 691054
GlobalTracer [baseline] (297.007 ms) : 0, 297007
GlobalTracer [candidate] (300.314 ms) : 0, 300314
AppSec [baseline] (151.196 ms) : 0, 151196
AppSec [candidate] (151.619 ms) : 0, 151619
Remote Config [baseline] (631.946 µs) : 0, 632
Remote Config [candidate] (640.977 µs) : 0, 641
Telemetry [baseline] (6.883 ms) : 0, 6883
Telemetry [candidate] (7.006 ms) : 0, 7006
section iast
BytebuddyAgent [baseline] (794.416 ms) : 0, 794416
BytebuddyAgent [candidate] (796.105 ms) : 0, 796105
GlobalTracer [baseline] (287.899 ms) : 0, 287899
GlobalTracer [candidate] (288.12 ms) : 0, 288120
AppSec [baseline] (54.048 ms) : 0, 54048
AppSec [candidate] (56.615 ms) : 0, 56615
Remote Config [baseline] (615.944 µs) : 0, 616
Remote Config [candidate] (612.252 µs) : 0, 612
Telemetry [baseline] (6.586 ms) : 0, 6586
Telemetry [candidate] (6.575 ms) : 0, 6575
IAST [baseline] (21.396 ms) : 0, 21396
IAST [candidate] (19.603 ms) : 0, 19603
section profiling
ProfilingAgent [baseline] (92.587 ms) : 0, 92587
ProfilingAgent [candidate] (111.65 ms) : 0, 111650
BytebuddyAgent [baseline] (678.333 ms) : 0, 678333
BytebuddyAgent [candidate] (683.192 ms) : 0, 683192
GlobalTracer [baseline] (379.858 ms) : 0, 379858
GlobalTracer [candidate] (382.641 ms) : 0, 382641
AppSec [baseline] (53.026 ms) : 0, 53026
AppSec [candidate] (52.992 ms) : 0, 52992
Remote Config [baseline] (829.639 µs) : 0, 830
Remote Config [candidate] (800.094 µs) : 0, 800
Telemetry [baseline] (10.788 ms) : 0, 10788
Telemetry [candidate] (10.904 ms) : 0, 10904
Profiling [baseline] (92.613 ms) : 0, 92613
Profiling [candidate] (111.675 ms) : 0, 111675

Load

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
end_time	2024-02-26T20:26:07	2024-02-26T20:44:41
git_branch	master	nayeem-kamal/limit-telemetry-dependency-queue-size
git_commit_date	1708941753	1708977519
git_commit_sha	`ab5e982`	`856a99f`
release_version	1.31.0-SNAPSHOT~ab5e98242b	1.31.0-SNAPSHOT~856a99f46a
start_time	2024-02-26T20:25:54	2024-02-26T20:44:28

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1708980580	1708980580
ci_job_id	443705723	443705723
ci_pipeline_id	28959121	28959121
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant	iast	iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 16 unstable metrics.

Request duration reports for petclinic

gantt
    title petclinic - request duration [CI 0.99] : candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.347 ms) : 1328, 1366
.   : milestone, 1347,
appsec (1.789 ms) : 1764, 1813
.   : milestone, 1789,
iast (1.526 ms) : 1503, 1549
.   : milestone, 1526,
profiling (1.494 ms) : 1471, 1518
.   : milestone, 1494,
tracing (1.508 ms) : 1485, 1531
.   : milestone, 1508,
section candidate
no_agent (1.348 ms) : 1329, 1367
.   : milestone, 1348,
appsec (1.773 ms) : 1750, 1796
.   : milestone, 1773,
iast (1.526 ms) : 1502, 1549
.   : milestone, 1526,
profiling (1.523 ms) : 1499, 1547
.   : milestone, 1523,
tracing (1.483 ms) : 1460, 1506
.   : milestone, 1483,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.347 ms [1.328 ms, 1.366 ms]	-
appsec	1.789 ms [1.764 ms, 1.813 ms]	441.668 µs (32.8%)
iast	1.526 ms [1.503 ms, 1.549 ms]	179.082 µs (13.3%)
profiling	1.494 ms [1.471 ms, 1.518 ms]	147.199 µs (10.9%)
tracing	1.508 ms [1.485 ms, 1.531 ms]	161.068 µs (12.0%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.348 ms [1.329 ms, 1.367 ms]	-
appsec	1.773 ms [1.75 ms, 1.796 ms]	425.135 µs (31.5%)
iast	1.526 ms [1.502 ms, 1.549 ms]	177.891 µs (13.2%)
profiling	1.523 ms [1.499 ms, 1.547 ms]	175.179 µs (13.0%)
tracing	1.483 ms [1.46 ms, 1.506 ms]	135.114 µs (10.0%)

Request duration reports for insecure-bank

gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b
    dateFormat X
    axisFormat %s
section baseline
no_agent (365.191 µs) : 345, 385
.   : milestone, 365,
iast (473.753 µs) : 453, 495
.   : milestone, 474,
iast_FULL (534.851 µs) : 515, 555
.   : milestone, 535,
iast_GLOBAL (499.551 µs) : 478, 521
.   : milestone, 500,
iast_HARDCODED_SECRET_DISABLED (474.118 µs) : 454, 495
.   : milestone, 474,
iast_INACTIVE (436.618 µs) : 416, 457
.   : milestone, 437,
iast_TELEMETRY_OFF (471.972 µs) : 451, 493
.   : milestone, 472,
tracing (440.353 µs) : 419, 461
.   : milestone, 440,
section candidate
no_agent (356.5 µs) : 337, 376
.   : milestone, 356,
iast (473.493 µs) : 453, 494
.   : milestone, 473,
iast_FULL (532.675 µs) : 512, 553
.   : milestone, 533,
iast_GLOBAL (490.575 µs) : 470, 511
.   : milestone, 491,
iast_HARDCODED_SECRET_DISABLED (468.465 µs) : 448, 489
.   : milestone, 468,
iast_INACTIVE (446.822 µs) : 426, 468
.   : milestone, 447,
iast_TELEMETRY_OFF (464.644 µs) : 444, 485
.   : milestone, 465,
tracing (443.504 µs) : 423, 464
.   : milestone, 444,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	365.191 µs [345.227 µs, 385.156 µs]	-
iast	473.753 µs [452.767 µs, 494.738 µs]	108.561 µs (29.7%)
iast_FULL	534.851 µs [514.501 µs, 555.2 µs]	169.659 µs (46.5%)
iast_GLOBAL	499.551 µs [478.093 µs, 521.01 µs]	134.36 µs (36.8%)
iast_HARDCODED_SECRET_DISABLED	474.118 µs [453.735 µs, 494.501 µs]	108.927 µs (29.8%)
iast_INACTIVE	436.618 µs [416.468 µs, 456.769 µs]	71.427 µs (19.6%)
iast_TELEMETRY_OFF	471.972 µs [450.882 µs, 493.062 µs]	106.781 µs (29.2%)
tracing	440.353 µs [419.463 µs, 461.244 µs]	75.162 µs (20.6%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	356.5 µs [336.804 µs, 376.195 µs]	-
iast	473.493 µs [452.74 µs, 494.246 µs]	116.993 µs (32.8%)
iast_FULL	532.675 µs [511.937 µs, 553.414 µs]	176.175 µs (49.4%)
iast_GLOBAL	490.575 µs [470.429 µs, 510.72 µs]	134.075 µs (37.6%)
iast_HARDCODED_SECRET_DISABLED	468.465 µs [447.588 µs, 489.341 µs]	111.965 µs (31.4%)
iast_INACTIVE	446.822 µs [426.092 µs, 467.552 µs]	90.322 µs (25.3%)
iast_TELEMETRY_OFF	464.644 µs [443.983 µs, 485.305 µs]	108.144 µs (30.3%)
tracing	443.504 µs [422.777 µs, 464.23 µs]	87.004 µs (24.4%)

smola

We could probably set some large but not MAX_INTEGER by default.

smola · 2024-02-26T11:47:07Z

telemetry/src/main/java/datadog/telemetry/dependency/DependencyResolverQueue.java

  public void queueURI(URI uri) {
    if (uri == null) {
      return;
    }
-
+    if (processedUrlsSet.size() >= MAX_QUEUE_SIZE || newUrlsQueue.size() >= MAX_QUEUE_SIZE) {


Once we're past this threshold, we'll stop processing dependencies. So we could just set a boolean flag (e.g. disabled) and use that in later checks. This would allow:

Emptying processedUrlSet once we're past the limit. There's no point in keeping it around. We should probablu check if the queue is empty before clearing it, to avoid duplicates.

Print a single message, only the first time it becomes full, rather than on every enqueued dependency. That would make it far less noisy, and then we could increment it from debug to warning level.

Thanks for the review @smola!! I think these are all good ideas. I have updated my implementation to reflect these changes.
WRT "We should probably check if the queue is empty before clearing it, to avoid duplicates." : my new implementation should never reach the duplicate check again (where processedUrlSet is used) once the queuing is disabled so I didn't include an emptiness check for newUrlsQueue as even it is not empty, we won't need to check against the set. Let me know if the implementation looks good or if I may have missed something else.

* added limit for DependencyResolverQueue with a config to set a limit

added limit for DependencyResolverQueue with a config to set a limit

20dc4d0

removed unused import

a9d0b75

smola approved these changes Feb 26, 2024

View reviewed changes

smola reviewed Feb 26, 2024

View reviewed changes

nayeem-kamal added 2 commits February 26, 2024 14:31

implemented suggested improvements

514f4be

minor logic fix for limiting deps

856a99f

nayeem-kamal marked this pull request as ready for review February 26, 2024 20:01

nayeem-kamal requested a review from a team as a code owner February 26, 2024 20:01

nayeem-kamal requested review from dougqh and PerfectSlayer February 26, 2024 20:01

synchronized the hashset to prevent race conditions

2abaf74

nayeem-kamal merged commit c4b9c17 into master Feb 27, 2024
76 of 79 checks passed

nayeem-kamal deleted the nayeem-kamal/limit-telemetry-dependency-queue-size branch February 27, 2024 15:41

github-actions bot added this to the 1.31.0 milestone Feb 27, 2024

jandro996 pushed a commit that referenced this pull request Feb 29, 2024

Limit DependencyResolverQueue size with a config flag (#6729)

ccf8d66

* added limit for DependencyResolverQueue with a config to set a limit

bm1549 added type: bug comp: telemetry Telemetry labels Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit DependencyResolverQueue size with a config flag #6729

Limit DependencyResolverQueue size with a config flag #6729

nayeem-kamal commented Feb 23, 2024 •

edited

Loading

pr-commenter bot commented Feb 23, 2024 •

edited

Loading

smola left a comment

smola Feb 26, 2024

nayeem-kamal Feb 26, 2024

Limit DependencyResolverQueue size with a config flag #6729

Limit DependencyResolverQueue size with a config flag #6729

Conversation

nayeem-kamal commented Feb 23, 2024 • edited Loading

What Does This Do

Motivation

Additional Notes

pr-commenter bot commented Feb 23, 2024 • edited Loading

Benchmarks

Startup

Parameters

Summary

Load

Parameters

Summary

smola left a comment

Choose a reason for hiding this comment

smola Feb 26, 2024

Choose a reason for hiding this comment

nayeem-kamal Feb 26, 2024

Choose a reason for hiding this comment

nayeem-kamal commented Feb 23, 2024 •

edited

Loading

pr-commenter bot commented Feb 23, 2024 •

edited

Loading