Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit DependencyResolverQueue size with a config flag #6729

Merged

Conversation

nayeem-kamal
Copy link
Contributor

@nayeem-kamal nayeem-kamal commented Feb 23, 2024

What Does This Do

Allows the user to set a limit to the size of the telemetry DependencyResolverQueue to prevent OOM errors when there are a large number of dependencies
Introduces a new config variable (dd.telemetry.dependency-resolution.queue.size) which takes an integer to set as the limit for the number of dependencies to store in the queue.

Motivation

Resolve APMS-11445

Additional Notes

Jira ticket: APMS-11445

@pr-commenter
Copy link

pr-commenter bot commented Feb 23, 2024

Benchmarks

Startup

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master nayeem-kamal/limit-telemetry-dependency-queue-size
git_commit_date 1708941753 1708977519
git_commit_sha ab5e982 856a99f
release_version 1.31.0-SNAPSHOT~ab5e98242b 1.31.0-SNAPSHOT~856a99f46a
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1708980580 1708980580
ci_job_id 443705723 443705723
ci_pipeline_id 28959121 28959121
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module Agent Agent
parent None None
variant iast iast

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 41 metrics, 11 unstable metrics.

scenario Δ mean execution_time candidate mean execution_time baseline mean execution_time
scenario:startup:petclinic:profiling:ProfilingAgent worse
[+15.346ms; +22.780ms] or [+16.574%; +24.604%]
111.650ms 92.587ms
scenario:startup:petclinic:profiling:Profiling worse
[+15.344ms; +22.778ms] or [+16.568%; +24.595%]
111.675ms 92.613ms
Startup time reports for petclinic
gantt
    title petclinic - global startup overhead: candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.077 s) : 0, 1077274
Total [baseline] (9.17 s) : 0, 9169625
Agent [candidate] (1.088 s) : 0, 1088103
Total [candidate] (9.197 s) : 0, 9196849
section appsec
Agent [baseline] (1.175 s) : 0, 1174688
Total [baseline] (9.385 s) : 0, 9384913
Agent [candidate] (1.185 s) : 0, 1185156
Total [candidate] (9.345 s) : 0, 9344991
section iast
Agent [baseline] (1.199 s) : 0, 1199246
Total [baseline] (9.389 s) : 0, 9388985
Agent [candidate] (1.202 s) : 0, 1201852
Total [candidate] (9.334 s) : 0, 9333752
section profiling
Agent [baseline] (1.271 s) : 0, 1271451
Total [baseline] (9.333 s) : 0, 9332762
Agent [candidate] (1.299 s) : 0, 1298551
Total [candidate] (9.355 s) : 0, 9355496
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.077 s -
Agent appsec 1.175 s 97.414 ms (9.0%)
Agent iast 1.199 s 121.973 ms (11.3%)
Agent profiling 1.271 s 194.177 ms (18.0%)
Total tracing 9.17 s -
Total appsec 9.385 s 215.288 ms (2.3%)
Total iast 9.389 s 219.36 ms (2.4%)
Total profiling 9.333 s 163.137 ms (1.8%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.088 s -
Agent appsec 1.185 s 97.053 ms (8.9%)
Agent iast 1.202 s 113.748 ms (10.5%)
Agent profiling 1.299 s 210.448 ms (19.3%)
Total tracing 9.197 s -
Total appsec 9.345 s 148.142 ms (1.6%)
Total iast 9.334 s 136.903 ms (1.5%)
Total profiling 9.355 s 158.647 ms (1.7%)
gantt
    title petclinic - break down per module: candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (685.796 ms) : 0, 685796
BytebuddyAgent [candidate] (692.747 ms) : 0, 692747
GlobalTracer [baseline] (297.426 ms) : 0, 297426
GlobalTracer [candidate] (300.432 ms) : 0, 300432
AppSec [baseline] (51.349 ms) : 0, 51349
AppSec [candidate] (51.633 ms) : 0, 51633
Remote Config [baseline] (707.729 µs) : 0, 708
Remote Config [candidate] (724.531 µs) : 0, 725
Telemetry [baseline] (7.692 ms) : 0, 7692
Telemetry [candidate] (7.928 ms) : 0, 7928
section appsec
BytebuddyAgent [baseline] (684.8 ms) : 0, 684800
BytebuddyAgent [candidate] (691.054 ms) : 0, 691054
GlobalTracer [baseline] (297.007 ms) : 0, 297007
GlobalTracer [candidate] (300.314 ms) : 0, 300314
AppSec [baseline] (151.196 ms) : 0, 151196
AppSec [candidate] (151.619 ms) : 0, 151619
Remote Config [baseline] (631.946 µs) : 0, 632
Remote Config [candidate] (640.977 µs) : 0, 641
Telemetry [baseline] (6.883 ms) : 0, 6883
Telemetry [candidate] (7.006 ms) : 0, 7006
section iast
BytebuddyAgent [baseline] (794.416 ms) : 0, 794416
BytebuddyAgent [candidate] (796.105 ms) : 0, 796105
GlobalTracer [baseline] (287.899 ms) : 0, 287899
GlobalTracer [candidate] (288.12 ms) : 0, 288120
AppSec [baseline] (54.048 ms) : 0, 54048
AppSec [candidate] (56.615 ms) : 0, 56615
Remote Config [baseline] (615.944 µs) : 0, 616
Remote Config [candidate] (612.252 µs) : 0, 612
Telemetry [baseline] (6.586 ms) : 0, 6586
Telemetry [candidate] (6.575 ms) : 0, 6575
IAST [baseline] (21.396 ms) : 0, 21396
IAST [candidate] (19.603 ms) : 0, 19603
section profiling
ProfilingAgent [baseline] (92.587 ms) : 0, 92587
ProfilingAgent [candidate] (111.65 ms) : 0, 111650
BytebuddyAgent [baseline] (678.333 ms) : 0, 678333
BytebuddyAgent [candidate] (683.192 ms) : 0, 683192
GlobalTracer [baseline] (379.858 ms) : 0, 379858
GlobalTracer [candidate] (382.641 ms) : 0, 382641
AppSec [baseline] (53.026 ms) : 0, 53026
AppSec [candidate] (52.992 ms) : 0, 52992
Remote Config [baseline] (829.639 µs) : 0, 830
Remote Config [candidate] (800.094 µs) : 0, 800
Telemetry [baseline] (10.788 ms) : 0, 10788
Telemetry [candidate] (10.904 ms) : 0, 10904
Profiling [baseline] (92.613 ms) : 0, 92613
Profiling [candidate] (111.675 ms) : 0, 111675

Load

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
end_time 2024-02-26T20:26:07 2024-02-26T20:44:41
git_branch master nayeem-kamal/limit-telemetry-dependency-queue-size
git_commit_date 1708941753 1708977519
git_commit_sha ab5e982 856a99f
release_version 1.31.0-SNAPSHOT~ab5e98242b 1.31.0-SNAPSHOT~856a99f46a
start_time 2024-02-26T20:25:54 2024-02-26T20:44:28
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1708980580 1708980580
ci_job_id 443705723 443705723
ci_pipeline_id 28959121 28959121
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 16 unstable metrics.

Request duration reports for petclinic
gantt
    title petclinic - request duration [CI 0.99] : candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.347 ms) : 1328, 1366
.   : milestone, 1347,
appsec (1.789 ms) : 1764, 1813
.   : milestone, 1789,
iast (1.526 ms) : 1503, 1549
.   : milestone, 1526,
profiling (1.494 ms) : 1471, 1518
.   : milestone, 1494,
tracing (1.508 ms) : 1485, 1531
.   : milestone, 1508,
section candidate
no_agent (1.348 ms) : 1329, 1367
.   : milestone, 1348,
appsec (1.773 ms) : 1750, 1796
.   : milestone, 1773,
iast (1.526 ms) : 1502, 1549
.   : milestone, 1526,
profiling (1.523 ms) : 1499, 1547
.   : milestone, 1523,
tracing (1.483 ms) : 1460, 1506
.   : milestone, 1483,
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.347 ms [1.328 ms, 1.366 ms] -
appsec 1.789 ms [1.764 ms, 1.813 ms] 441.668 µs (32.8%)
iast 1.526 ms [1.503 ms, 1.549 ms] 179.082 µs (13.3%)
profiling 1.494 ms [1.471 ms, 1.518 ms] 147.199 µs (10.9%)
tracing 1.508 ms [1.485 ms, 1.531 ms] 161.068 µs (12.0%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.348 ms [1.329 ms, 1.367 ms] -
appsec 1.773 ms [1.75 ms, 1.796 ms] 425.135 µs (31.5%)
iast 1.526 ms [1.502 ms, 1.549 ms] 177.891 µs (13.2%)
profiling 1.523 ms [1.499 ms, 1.547 ms] 175.179 µs (13.0%)
tracing 1.483 ms [1.46 ms, 1.506 ms] 135.114 µs (10.0%)
Request duration reports for insecure-bank
gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.31.0-SNAPSHOT~856a99f46a, baseline=1.31.0-SNAPSHOT~ab5e98242b
    dateFormat X
    axisFormat %s
section baseline
no_agent (365.191 µs) : 345, 385
.   : milestone, 365,
iast (473.753 µs) : 453, 495
.   : milestone, 474,
iast_FULL (534.851 µs) : 515, 555
.   : milestone, 535,
iast_GLOBAL (499.551 µs) : 478, 521
.   : milestone, 500,
iast_HARDCODED_SECRET_DISABLED (474.118 µs) : 454, 495
.   : milestone, 474,
iast_INACTIVE (436.618 µs) : 416, 457
.   : milestone, 437,
iast_TELEMETRY_OFF (471.972 µs) : 451, 493
.   : milestone, 472,
tracing (440.353 µs) : 419, 461
.   : milestone, 440,
section candidate
no_agent (356.5 µs) : 337, 376
.   : milestone, 356,
iast (473.493 µs) : 453, 494
.   : milestone, 473,
iast_FULL (532.675 µs) : 512, 553
.   : milestone, 533,
iast_GLOBAL (490.575 µs) : 470, 511
.   : milestone, 491,
iast_HARDCODED_SECRET_DISABLED (468.465 µs) : 448, 489
.   : milestone, 468,
iast_INACTIVE (446.822 µs) : 426, 468
.   : milestone, 447,
iast_TELEMETRY_OFF (464.644 µs) : 444, 485
.   : milestone, 465,
tracing (443.504 µs) : 423, 464
.   : milestone, 444,
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 365.191 µs [345.227 µs, 385.156 µs] -
iast 473.753 µs [452.767 µs, 494.738 µs] 108.561 µs (29.7%)
iast_FULL 534.851 µs [514.501 µs, 555.2 µs] 169.659 µs (46.5%)
iast_GLOBAL 499.551 µs [478.093 µs, 521.01 µs] 134.36 µs (36.8%)
iast_HARDCODED_SECRET_DISABLED 474.118 µs [453.735 µs, 494.501 µs] 108.927 µs (29.8%)
iast_INACTIVE 436.618 µs [416.468 µs, 456.769 µs] 71.427 µs (19.6%)
iast_TELEMETRY_OFF 471.972 µs [450.882 µs, 493.062 µs] 106.781 µs (29.2%)
tracing 440.353 µs [419.463 µs, 461.244 µs] 75.162 µs (20.6%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 356.5 µs [336.804 µs, 376.195 µs] -
iast 473.493 µs [452.74 µs, 494.246 µs] 116.993 µs (32.8%)
iast_FULL 532.675 µs [511.937 µs, 553.414 µs] 176.175 µs (49.4%)
iast_GLOBAL 490.575 µs [470.429 µs, 510.72 µs] 134.075 µs (37.6%)
iast_HARDCODED_SECRET_DISABLED 468.465 µs [447.588 µs, 489.341 µs] 111.965 µs (31.4%)
iast_INACTIVE 446.822 µs [426.092 µs, 467.552 µs] 90.322 µs (25.3%)
iast_TELEMETRY_OFF 464.644 µs [443.983 µs, 485.305 µs] 108.144 µs (30.3%)
tracing 443.504 µs [422.777 µs, 464.23 µs] 87.004 µs (24.4%)

Copy link
Member

@smola smola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably set some large but not MAX_INTEGER by default.

public void queueURI(URI uri) {
if (uri == null) {
return;
}

if (processedUrlsSet.size() >= MAX_QUEUE_SIZE || newUrlsQueue.size() >= MAX_QUEUE_SIZE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we're past this threshold, we'll stop processing dependencies. So we could just set a boolean flag (e.g. disabled) and use that in later checks. This would allow:

  • Emptying processedUrlSet once we're past the limit. There's no point in keeping it around. We should probablu check if the queue is empty before clearing it, to avoid duplicates.
  • Print a single message, only the first time it becomes full, rather than on every enqueued dependency. That would make it far less noisy, and then we could increment it from debug to warning level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @smola!! I think these are all good ideas. I have updated my implementation to reflect these changes.
WRT "We should probably check if the queue is empty before clearing it, to avoid duplicates." : my new implementation should never reach the duplicate check again (where processedUrlSet is used) once the queuing is disabled so I didn't include an emptiness check for newUrlsQueue as even it is not empty, we won't need to check against the set. Let me know if the implementation looks good or if I may have missed something else.

@nayeem-kamal nayeem-kamal marked this pull request as ready for review February 26, 2024 20:01
@nayeem-kamal nayeem-kamal requested a review from a team as a code owner February 26, 2024 20:01
@nayeem-kamal nayeem-kamal merged commit c4b9c17 into master Feb 27, 2024
76 of 79 checks passed
@nayeem-kamal nayeem-kamal deleted the nayeem-kamal/limit-telemetry-dependency-queue-size branch February 27, 2024 15:41
@github-actions github-actions bot added this to the 1.31.0 milestone Feb 27, 2024
jandro996 pushed a commit that referenced this pull request Feb 29, 2024
* added limit for DependencyResolverQueue with a config to set a limit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants