Fixes to trace agent service rates feedback - e.g. MAX TPS support #6628

dougqh · 2024-02-06T21:36:14Z

This change set fixes the MAX TPS support in the tracer.

What Does This Do

Previously, the tracer treated service name and env in a case sensitive fashion. Additionally, the tracer didn't properly handle the fallback case of an empty service and empty env.

While the empty service and empty env case was parsed properly, getSampler would return a static DEFAULT sampler that always used a rate 1.0. This wasn't obvious from the tests because the tests also use a rate of 1.0.

Motivation

Bring Java tracer into compliance with trace agent feedback specification

Jira ticket: APMS-11270

This change set fixes the MAX TPS support in the tracer. Previously, the tracer treated service name and env in a case sensitive fashion. Additionally, the tracer didn't properly handle the fallback case of an empty service and empty env. While the empty service and empty env case was parsed properly, getSampler would return a static DEFAULT sampler that always used a rate 1.0. This wasn't obvious from the tests because the tests also use a rate of 1.0.

dougqh · 2024-02-06T21:39:35Z

dd-trace-core/src/main/java/datadog/trace/common/sampling/RateByServiceTraceSampler.java

@@ -65,24 +65,37 @@ private <T extends CoreSpan<T>> String getSpanEnv(final T span) {
  public void onResponse(
      final String endpoint, final Map<String, Map<String, Number>> responseJson) {
    final Map<String, Number> newServiceRates = responseJson.get("rate_by_service");
-    if (null != newServiceRates) {


I refactored the code a fair amount.

The biggest change is introducing a fallbackSampler field rather than storing the fallbackSampler in the Map.

dougqh · 2024-02-06T21:40:53Z

dd-trace-core/src/main/java/datadog/trace/common/sampling/RateByServiceTraceSampler.java


    public static EnvAndService fromString(String key) {
      return CACHE.computeIfAbsent(key, PARSE);
    }

-    private final String env;
-    private final String service;
+    private final String lowerEnv;


Renamed so that it is clear that the values have been transformed

dougqh · 2024-02-06T21:41:33Z

...race-core/src/test/groovy/datadog/trace/common/sampling/RateByServiceTraceSamplerTest.groovy

    expect:
-    serviceSampler.serviceRates.getSampler(RateByServiceTraceSampler.EnvAndService.DEFAULT).sampleRate == expectedRate
+    serviceSampler.serviceRates.getSampler(RateByServiceTraceSampler.EnvAndService.FALLBACK).sampleRate == expectedRate
+    serviceSampler.serviceRates.getSampler("foo", "bar").sampleRate == expectedRate


This is the key check that was missing previously -- that would have caught the broken fallback handling

...race-core/src/test/groovy/datadog/trace/common/sampling/RateByServiceTraceSamplerTest.groovy

pr-commenter · 2024-02-06T22:22:22Z

Benchmarks

Startup

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	dougqh/agent-sampling-feedback-fixes
git_commit_date	1707322045	1707343835
git_commit_sha	`a1e2d77`	`c30a574`
release_version	1.30.0-SNAPSHOT~a1e2d778c0	1.30.0-SNAPSHOT~c30a5741e1

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1707346721	1707346721
ci_job_id	428939138	428939138
ci_pipeline_id	27924388	27924388
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module	Agent	Agent
parent	None	None
variant	iast	iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 10 unstable metrics.

Startup time reports for petclinic

gantt
    title petclinic - global startup overhead: candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.06 s) : 0, 1060124
Total [baseline] (9.305 s) : 0, 9304599
Agent [candidate] (1.066 s) : 0, 1066094
Total [candidate] (9.367 s) : 0, 9367071
section appsec
Agent [baseline] (1.157 s) : 0, 1157240
Total [baseline] (9.479 s) : 0, 9478851
Agent [candidate] (1.163 s) : 0, 1163245
Total [candidate] (9.436 s) : 0, 9436145
section iast
Agent [baseline] (1.181 s) : 0, 1181012
Total [baseline] (9.631 s) : 0, 9630736
Agent [candidate] (1.185 s) : 0, 1184834
Total [candidate] (9.814 s) : 0, 9813813
section profiling
Agent [baseline] (1.272 s) : 0, 1272324
Total [baseline] (9.502 s) : 0, 9502011
Agent [candidate] (1.277 s) : 0, 1277097
Total [candidate] (9.622 s) : 0, 9622403

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.06 s	-
Agent	appsec	1.157 s	97.115 ms (9.2%)
Agent	iast	1.181 s	120.887 ms (11.4%)
Agent	profiling	1.272 s	212.2 ms (20.0%)
Total	tracing	9.305 s	-
Total	appsec	9.479 s	174.251 ms (1.9%)
Total	iast	9.631 s	326.136 ms (3.5%)
Total	profiling	9.502 s	197.412 ms (2.1%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.066 s	-
Agent	appsec	1.163 s	97.151 ms (9.1%)
Agent	iast	1.185 s	118.739 ms (11.1%)
Agent	profiling	1.277 s	211.003 ms (19.8%)
Total	tracing	9.367 s	-
Total	appsec	9.436 s	69.074 ms (0.7%)
Total	iast	9.814 s	446.742 ms (4.8%)
Total	profiling	9.622 s	255.332 ms (2.7%)

gantt
    title petclinic - break down per module: candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (667.756 ms) : 0, 667756
BytebuddyAgent [candidate] (671.304 ms) : 0, 671304
GlobalTracer [baseline] (297.844 ms) : 0, 297844
GlobalTracer [candidate] (300.007 ms) : 0, 300007
AppSec [baseline] (52.008 ms) : 0, 52008
AppSec [candidate] (52.04 ms) : 0, 52040
Remote Config [baseline] (693.88 µs) : 0, 694
Remote Config [candidate] (694.203 µs) : 0, 694
Telemetry [baseline] (7.574 ms) : 0, 7574
Telemetry [candidate] (7.581 ms) : 0, 7581
section appsec
BytebuddyAgent [baseline] (668.145 ms) : 0, 668145
BytebuddyAgent [candidate] (671.489 ms) : 0, 671489
GlobalTracer [baseline] (297.73 ms) : 0, 297730
GlobalTracer [candidate] (299.283 ms) : 0, 299283
AppSec [baseline] (149.729 ms) : 0, 149729
AppSec [candidate] (150.483 ms) : 0, 150483
Remote Config [baseline] (646.605 µs) : 0, 647
Remote Config [candidate] (653.047 µs) : 0, 653
Telemetry [baseline] (6.763 ms) : 0, 6763
Telemetry [candidate] (6.818 ms) : 0, 6818
section iast
BytebuddyAgent [baseline] (776.512 ms) : 0, 776512
BytebuddyAgent [candidate] (779.007 ms) : 0, 779007
GlobalTracer [baseline] (288.038 ms) : 0, 288038
GlobalTracer [candidate] (289.666 ms) : 0, 289666
AppSec [baseline] (52.564 ms) : 0, 52564
AppSec [candidate] (52.545 ms) : 0, 52545
Remote Config [baseline] (1.349 ms) : 0, 1349
Remote Config [candidate] (630.063 µs) : 0, 630
Telemetry [baseline] (6.521 ms) : 0, 6521
Telemetry [candidate] (7.382 ms) : 0, 7382
IAST [baseline] (21.919 ms) : 0, 21919
IAST [candidate] (21.374 ms) : 0, 21374
section profiling
BytebuddyAgent [baseline] (662.664 ms) : 0, 662664
BytebuddyAgent [candidate] (664.817 ms) : 0, 664817
GlobalTracer [baseline] (380.923 ms) : 0, 380923
GlobalTracer [candidate] (382.208 ms) : 0, 382208
AppSec [baseline] (51.823 ms) : 0, 51823
AppSec [candidate] (52.365 ms) : 0, 52365
Remote Config [baseline] (656.791 µs) : 0, 657
Remote Config [candidate] (669.484 µs) : 0, 669
Telemetry [baseline] (8.06 ms) : 0, 8060
Telemetry [candidate] (8.747 ms) : 0, 8747
ProfilingAgent [baseline] (113.823 ms) : 0, 113823
ProfilingAgent [candidate] (113.831 ms) : 0, 113831
Profiling [baseline] (113.847 ms) : 0, 113847
Profiling [candidate] (113.855 ms) : 0, 113855

Load

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
end_time	2024-02-07T22:34:45	2024-02-07T22:53:39
git_branch	master	dougqh/agent-sampling-feedback-fixes
git_commit_date	1707322045	1707343835
git_commit_sha	`a1e2d77`	`c30a574`
release_version	1.30.0-SNAPSHOT~a1e2d778c0	1.30.0-SNAPSHOT~c30a5741e1
start_time	2024-02-07T22:34:32	2024-02-07T22:53:26

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1707346721	1707346721
ci_job_id	428939138	428939138
ci_pipeline_id	27924388	27924388
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant	iast	iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 16 unstable metrics.

Request duration reports for petclinic

gantt
    title petclinic - request duration [CI 0.99] : candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.356 ms) : 1337, 1375
.   : milestone, 1356,
appsec (1.753 ms) : 1728, 1779
.   : milestone, 1753,
iast (1.517 ms) : 1492, 1541
.   : milestone, 1517,
profiling (1.506 ms) : 1481, 1531
.   : milestone, 1506,
tracing (1.505 ms) : 1480, 1530
.   : milestone, 1505,
section candidate
no_agent (1.358 ms) : 1339, 1377
.   : milestone, 1358,
appsec (1.768 ms) : 1743, 1793
.   : milestone, 1768,
iast (1.505 ms) : 1480, 1529
.   : milestone, 1505,
profiling (1.524 ms) : 1497, 1551
.   : milestone, 1524,
tracing (1.51 ms) : 1485, 1535
.   : milestone, 1510,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.356 ms [1.337 ms, 1.375 ms]	-
appsec	1.753 ms [1.728 ms, 1.779 ms]	396.892 µs (29.3%)
iast	1.517 ms [1.492 ms, 1.541 ms]	160.07 µs (11.8%)
profiling	1.506 ms [1.481 ms, 1.531 ms]	149.124 µs (11.0%)
tracing	1.505 ms [1.48 ms, 1.53 ms]	148.683 µs (11.0%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.358 ms [1.339 ms, 1.377 ms]	-
appsec	1.768 ms [1.743 ms, 1.793 ms]	410.237 µs (30.2%)
iast	1.505 ms [1.48 ms, 1.529 ms]	146.602 µs (10.8%)
profiling	1.524 ms [1.497 ms, 1.551 ms]	165.831 µs (12.2%)
tracing	1.51 ms [1.485 ms, 1.535 ms]	151.934 µs (11.2%)

Request duration reports for insecure-bank

gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0
    dateFormat X
    axisFormat %s
section baseline
no_agent (362.985 µs) : 343, 383
.   : milestone, 363,
iast (467.725 µs) : 447, 488
.   : milestone, 468,
iast_FULL (534.042 µs) : 513, 555
.   : milestone, 534,
iast_GLOBAL (500.878 µs) : 480, 522
.   : milestone, 501,
iast_HARDCODED_SECRET_DISABLED (466.864 µs) : 446, 487
.   : milestone, 467,
iast_INACTIVE (444.555 µs) : 423, 466
.   : milestone, 445,
iast_TELEMETRY_OFF (471.142 µs) : 450, 492
.   : milestone, 471,
tracing (436.859 µs) : 416, 458
.   : milestone, 437,
section candidate
no_agent (367.081 µs) : 346, 388
.   : milestone, 367,
iast (473.225 µs) : 452, 494
.   : milestone, 473,
iast_FULL (530.173 µs) : 510, 551
.   : milestone, 530,
iast_GLOBAL (500.35 µs) : 479, 522
.   : milestone, 500,
iast_HARDCODED_SECRET_DISABLED (469.793 µs) : 449, 491
.   : milestone, 470,
iast_INACTIVE (438.99 µs) : 418, 460
.   : milestone, 439,
iast_TELEMETRY_OFF (465.933 µs) : 445, 487
.   : milestone, 466,
tracing (436.1 µs) : 416, 456
.   : milestone, 436,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	362.985 µs [343.2 µs, 382.77 µs]	-
iast	467.725 µs [447.225 µs, 488.225 µs]	104.74 µs (28.9%)
iast_FULL	534.042 µs [513.471 µs, 554.612 µs]	171.057 µs (47.1%)
iast_GLOBAL	500.878 µs [479.555 µs, 522.201 µs]	137.893 µs (38.0%)
iast_HARDCODED_SECRET_DISABLED	466.864 µs [446.397 µs, 487.33 µs]	103.879 µs (28.6%)
iast_INACTIVE	444.555 µs [422.999 µs, 466.111 µs]	81.57 µs (22.5%)
iast_TELEMETRY_OFF	471.142 µs [450.283 µs, 492.0 µs]	108.157 µs (29.8%)
tracing	436.859 µs [416.041 µs, 457.677 µs]	73.874 µs (20.4%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	367.081 µs [345.958 µs, 388.204 µs]	-
iast	473.225 µs [451.973 µs, 494.478 µs]	106.145 µs (28.9%)
iast_FULL	530.173 µs [509.576 µs, 550.769 µs]	163.092 µs (44.4%)
iast_GLOBAL	500.35 µs [478.78 µs, 521.92 µs]	133.269 µs (36.3%)
iast_HARDCODED_SECRET_DISABLED	469.793 µs [449.07 µs, 490.516 µs]	102.712 µs (28.0%)
iast_INACTIVE	438.99 µs [418.166 µs, 459.815 µs]	71.91 µs (19.6%)
iast_TELEMETRY_OFF	465.933 µs [445.253 µs, 486.613 µs]	98.852 µs (26.9%)
tracing	436.1 µs [415.988 µs, 456.212 µs]	69.019 µs (18.8%)

dougqh · 2024-02-07T00:15:58Z

dd-trace-core/src/main/java/datadog/trace/common/sampling/RateByServiceTraceSampler.java

      if (serviceRates == null) {
-        return DEFAULT;


These lines returning DEFAULT are the bugs in the original implementation.

Checking case insensitivity Checking partial match of service & partial match of env Checking use of fallback

dougqh · 2024-02-07T14:24:28Z

...race-core/src/test/groovy/datadog/trace/common/sampling/RateByServiceTraceSamplerTest.groovy

+
+    where:
+    service | env     | expectedRate
+    "foo"   | "bar"   | 0.8


New tests to verify case insensitivity
And check that partial matches uses the fallback

ygree · 2024-02-07T18:11:52Z

...race-core/src/test/groovy/datadog/trace/common/sampling/RateByServiceTraceSamplerTest.groovy

+    def tracer = tracerBuilder().writer(new ListWriter()).build()
+
+    when:
+    def response = '{"rate_by_service": {"service:spock,env:test":1.0}}'


I'd add one more rule ,"service:SPOCK,env:Test":0.0 to make sure we verify case-insensitivity otherwise this test will pass even if it's case-sensitive b/o the default case.

Added another commit with that test in addition to some others that check case insensitive equivalent service/env pairs

ygree

Looks good! Left one proposal for the test to make sure we are testing case-insensitivity and not the default case

Adding tests that cover case insensitive equivalent service/env pairs Not defined in spec, but in Java tracer implemented as first one wins Also added tests to cover partial collisions (e.g. just service or just env) -- both same case and differing case

PerfectSlayer

I didn't had the time to submit my review before Yury approved it but I wanted to share I appreciated the inline comments about the refactoring part, thanks for taking time to add them 👍

dougqh requested a review from a team as a code owner February 6, 2024 21:36

dougqh requested review from PerfectSlayer and nayeem-kamal February 6, 2024 21:36

dougqh commented Feb 6, 2024

View reviewed changes

...race-core/src/test/groovy/datadog/trace/common/sampling/RateByServiceTraceSamplerTest.groovy Show resolved Hide resolved

dougqh added 2 commits February 6, 2024 16:47

Tweak to test, so we're checking insensitivity of env, too

67932f5

Removing TODO

70eee78

dougqh commented Feb 7, 2024

View reviewed changes

Adding more tests to check that the right rate is chosen

868beb4

Checking case insensitivity Checking partial match of service & partial match of env Checking use of fallback

dougqh commented Feb 7, 2024

View reviewed changes

ygree reviewed Feb 7, 2024

View reviewed changes

ygree approved these changes Feb 7, 2024

View reviewed changes

dougqh added 2 commits February 7, 2024 17:02

Merge branch 'master' into dougqh/agent-sampling-feedback-fixes

c30a574

PerfectSlayer approved these changes Feb 8, 2024

View reviewed changes

dougqh merged commit f90cc8a into master Feb 8, 2024
79 checks passed

dougqh deleted the dougqh/agent-sampling-feedback-fixes branch February 8, 2024 18:35

github-actions bot added this to the 1.30.0 milestone Feb 8, 2024

bm1549 added type: bug comp: core Tracer core labels Feb 9, 2024

dougqh restored the dougqh/agent-sampling-feedback-fixes branch May 2, 2024 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to trace agent service rates feedback - e.g. MAX TPS support #6628

Fixes to trace agent service rates feedback - e.g. MAX TPS support #6628

dougqh commented Feb 6, 2024 •

edited by jira bot

Loading

dougqh Feb 6, 2024 •

edited

Loading

dougqh Feb 6, 2024

dougqh Feb 6, 2024

pr-commenter bot commented Feb 6, 2024 •

edited

Loading

dougqh Feb 7, 2024

dougqh Feb 7, 2024

ygree Feb 7, 2024

dougqh Feb 7, 2024

ygree left a comment

PerfectSlayer left a comment

Fixes to trace agent service rates feedback - e.g. MAX TPS support #6628

Fixes to trace agent service rates feedback - e.g. MAX TPS support #6628

Conversation

dougqh commented Feb 6, 2024 • edited by jira bot Loading

What Does This Do

Motivation

dougqh Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

dougqh Feb 6, 2024

Choose a reason for hiding this comment

dougqh Feb 6, 2024

Choose a reason for hiding this comment

pr-commenter bot commented Feb 6, 2024 • edited Loading

Benchmarks

Startup

Parameters

Summary

Load

Parameters

Summary

dougqh Feb 7, 2024

Choose a reason for hiding this comment

dougqh Feb 7, 2024

Choose a reason for hiding this comment

ygree Feb 7, 2024

Choose a reason for hiding this comment

dougqh Feb 7, 2024

Choose a reason for hiding this comment

ygree left a comment

Choose a reason for hiding this comment

PerfectSlayer left a comment

Choose a reason for hiding this comment

dougqh commented Feb 6, 2024 •

edited by jira bot

Loading

dougqh Feb 6, 2024 •

edited

Loading

pr-commenter bot commented Feb 6, 2024 •

edited

Loading