Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to trace agent service rates feedback - e.g. MAX TPS support #6628

Merged
merged 6 commits into from
Feb 8, 2024

Conversation

dougqh
Copy link
Contributor

@dougqh dougqh commented Feb 6, 2024

This change set fixes the MAX TPS support in the tracer.

What Does This Do

Previously, the tracer treated service name and env in a case sensitive fashion. Additionally, the tracer didn't properly handle the fallback case of an empty service and empty env.

While the empty service and empty env case was parsed properly, getSampler would return a static DEFAULT sampler that always used a rate 1.0. This wasn't obvious from the tests because the tests also use a rate of 1.0.

Motivation

Bring Java tracer into compliance with trace agent feedback specification

Jira ticket: APMS-11270

This change set fixes the MAX TPS support in the tracer.
Previously, the tracer treated service name and env in a case sensitive fashion.
Additionally, the tracer didn't properly handle the fallback case of an empty service and empty env.

While the empty service and empty env case was parsed properly, getSampler would return a static DEFAULT sampler that always used a rate 1.0.  This wasn't obvious from the tests because the tests also use a rate of 1.0.
@dougqh dougqh requested a review from a team as a code owner February 6, 2024 21:36
@@ -65,24 +65,37 @@ private <T extends CoreSpan<T>> String getSpanEnv(final T span) {
public void onResponse(
final String endpoint, final Map<String, Map<String, Number>> responseJson) {
final Map<String, Number> newServiceRates = responseJson.get("rate_by_service");
if (null != newServiceRates) {
Copy link
Contributor Author

@dougqh dougqh Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code a fair amount.

The biggest change is introducing a fallbackSampler field rather than storing the fallbackSampler in the Map.


public static EnvAndService fromString(String key) {
return CACHE.computeIfAbsent(key, PARSE);
}

private final String env;
private final String service;
private final String lowerEnv;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed so that it is clear that the values have been transformed

expect:
serviceSampler.serviceRates.getSampler(RateByServiceTraceSampler.EnvAndService.DEFAULT).sampleRate == expectedRate
serviceSampler.serviceRates.getSampler(RateByServiceTraceSampler.EnvAndService.FALLBACK).sampleRate == expectedRate
serviceSampler.serviceRates.getSampler("foo", "bar").sampleRate == expectedRate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key check that was missing previously -- that would have caught the broken fallback handling

@pr-commenter
Copy link

pr-commenter bot commented Feb 6, 2024

Benchmarks

Startup

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master dougqh/agent-sampling-feedback-fixes
git_commit_date 1707322045 1707343835
git_commit_sha a1e2d77 c30a574
release_version 1.30.0-SNAPSHOT~a1e2d778c0 1.30.0-SNAPSHOT~c30a5741e1
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1707346721 1707346721
ci_job_id 428939138 428939138
ci_pipeline_id 27924388 27924388
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module Agent Agent
parent None None
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 10 unstable metrics.

Startup time reports for petclinic
gantt
    title petclinic - global startup overhead: candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.06 s) : 0, 1060124
Total [baseline] (9.305 s) : 0, 9304599
Agent [candidate] (1.066 s) : 0, 1066094
Total [candidate] (9.367 s) : 0, 9367071
section appsec
Agent [baseline] (1.157 s) : 0, 1157240
Total [baseline] (9.479 s) : 0, 9478851
Agent [candidate] (1.163 s) : 0, 1163245
Total [candidate] (9.436 s) : 0, 9436145
section iast
Agent [baseline] (1.181 s) : 0, 1181012
Total [baseline] (9.631 s) : 0, 9630736
Agent [candidate] (1.185 s) : 0, 1184834
Total [candidate] (9.814 s) : 0, 9813813
section profiling
Agent [baseline] (1.272 s) : 0, 1272324
Total [baseline] (9.502 s) : 0, 9502011
Agent [candidate] (1.277 s) : 0, 1277097
Total [candidate] (9.622 s) : 0, 9622403
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.06 s -
Agent appsec 1.157 s 97.115 ms (9.2%)
Agent iast 1.181 s 120.887 ms (11.4%)
Agent profiling 1.272 s 212.2 ms (20.0%)
Total tracing 9.305 s -
Total appsec 9.479 s 174.251 ms (1.9%)
Total iast 9.631 s 326.136 ms (3.5%)
Total profiling 9.502 s 197.412 ms (2.1%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.066 s -
Agent appsec 1.163 s 97.151 ms (9.1%)
Agent iast 1.185 s 118.739 ms (11.1%)
Agent profiling 1.277 s 211.003 ms (19.8%)
Total tracing 9.367 s -
Total appsec 9.436 s 69.074 ms (0.7%)
Total iast 9.814 s 446.742 ms (4.8%)
Total profiling 9.622 s 255.332 ms (2.7%)
gantt
    title petclinic - break down per module: candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (667.756 ms) : 0, 667756
BytebuddyAgent [candidate] (671.304 ms) : 0, 671304
GlobalTracer [baseline] (297.844 ms) : 0, 297844
GlobalTracer [candidate] (300.007 ms) : 0, 300007
AppSec [baseline] (52.008 ms) : 0, 52008
AppSec [candidate] (52.04 ms) : 0, 52040
Remote Config [baseline] (693.88 µs) : 0, 694
Remote Config [candidate] (694.203 µs) : 0, 694
Telemetry [baseline] (7.574 ms) : 0, 7574
Telemetry [candidate] (7.581 ms) : 0, 7581
section appsec
BytebuddyAgent [baseline] (668.145 ms) : 0, 668145
BytebuddyAgent [candidate] (671.489 ms) : 0, 671489
GlobalTracer [baseline] (297.73 ms) : 0, 297730
GlobalTracer [candidate] (299.283 ms) : 0, 299283
AppSec [baseline] (149.729 ms) : 0, 149729
AppSec [candidate] (150.483 ms) : 0, 150483
Remote Config [baseline] (646.605 µs) : 0, 647
Remote Config [candidate] (653.047 µs) : 0, 653
Telemetry [baseline] (6.763 ms) : 0, 6763
Telemetry [candidate] (6.818 ms) : 0, 6818
section iast
BytebuddyAgent [baseline] (776.512 ms) : 0, 776512
BytebuddyAgent [candidate] (779.007 ms) : 0, 779007
GlobalTracer [baseline] (288.038 ms) : 0, 288038
GlobalTracer [candidate] (289.666 ms) : 0, 289666
AppSec [baseline] (52.564 ms) : 0, 52564
AppSec [candidate] (52.545 ms) : 0, 52545
Remote Config [baseline] (1.349 ms) : 0, 1349
Remote Config [candidate] (630.063 µs) : 0, 630
Telemetry [baseline] (6.521 ms) : 0, 6521
Telemetry [candidate] (7.382 ms) : 0, 7382
IAST [baseline] (21.919 ms) : 0, 21919
IAST [candidate] (21.374 ms) : 0, 21374
section profiling
BytebuddyAgent [baseline] (662.664 ms) : 0, 662664
BytebuddyAgent [candidate] (664.817 ms) : 0, 664817
GlobalTracer [baseline] (380.923 ms) : 0, 380923
GlobalTracer [candidate] (382.208 ms) : 0, 382208
AppSec [baseline] (51.823 ms) : 0, 51823
AppSec [candidate] (52.365 ms) : 0, 52365
Remote Config [baseline] (656.791 µs) : 0, 657
Remote Config [candidate] (669.484 µs) : 0, 669
Telemetry [baseline] (8.06 ms) : 0, 8060
Telemetry [candidate] (8.747 ms) : 0, 8747
ProfilingAgent [baseline] (113.823 ms) : 0, 113823
ProfilingAgent [candidate] (113.831 ms) : 0, 113831
Profiling [baseline] (113.847 ms) : 0, 113847
Profiling [candidate] (113.855 ms) : 0, 113855
Loading

Load

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
end_time 2024-02-07T22:34:45 2024-02-07T22:53:39
git_branch master dougqh/agent-sampling-feedback-fixes
git_commit_date 1707322045 1707343835
git_commit_sha a1e2d77 c30a574
release_version 1.30.0-SNAPSHOT~a1e2d778c0 1.30.0-SNAPSHOT~c30a5741e1
start_time 2024-02-07T22:34:32 2024-02-07T22:53:26
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1707346721 1707346721
ci_job_id 428939138 428939138
ci_pipeline_id 27924388 27924388
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 16 unstable metrics.

Request duration reports for petclinic
gantt
    title petclinic - request duration [CI 0.99] : candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.356 ms) : 1337, 1375
.   : milestone, 1356,
appsec (1.753 ms) : 1728, 1779
.   : milestone, 1753,
iast (1.517 ms) : 1492, 1541
.   : milestone, 1517,
profiling (1.506 ms) : 1481, 1531
.   : milestone, 1506,
tracing (1.505 ms) : 1480, 1530
.   : milestone, 1505,
section candidate
no_agent (1.358 ms) : 1339, 1377
.   : milestone, 1358,
appsec (1.768 ms) : 1743, 1793
.   : milestone, 1768,
iast (1.505 ms) : 1480, 1529
.   : milestone, 1505,
profiling (1.524 ms) : 1497, 1551
.   : milestone, 1524,
tracing (1.51 ms) : 1485, 1535
.   : milestone, 1510,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.356 ms [1.337 ms, 1.375 ms] -
appsec 1.753 ms [1.728 ms, 1.779 ms] 396.892 µs (29.3%)
iast 1.517 ms [1.492 ms, 1.541 ms] 160.07 µs (11.8%)
profiling 1.506 ms [1.481 ms, 1.531 ms] 149.124 µs (11.0%)
tracing 1.505 ms [1.48 ms, 1.53 ms] 148.683 µs (11.0%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.358 ms [1.339 ms, 1.377 ms] -
appsec 1.768 ms [1.743 ms, 1.793 ms] 410.237 µs (30.2%)
iast 1.505 ms [1.48 ms, 1.529 ms] 146.602 µs (10.8%)
profiling 1.524 ms [1.497 ms, 1.551 ms] 165.831 µs (12.2%)
tracing 1.51 ms [1.485 ms, 1.535 ms] 151.934 µs (11.2%)
Request duration reports for insecure-bank
gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.30.0-SNAPSHOT~c30a5741e1, baseline=1.30.0-SNAPSHOT~a1e2d778c0
    dateFormat X
    axisFormat %s
section baseline
no_agent (362.985 µs) : 343, 383
.   : milestone, 363,
iast (467.725 µs) : 447, 488
.   : milestone, 468,
iast_FULL (534.042 µs) : 513, 555
.   : milestone, 534,
iast_GLOBAL (500.878 µs) : 480, 522
.   : milestone, 501,
iast_HARDCODED_SECRET_DISABLED (466.864 µs) : 446, 487
.   : milestone, 467,
iast_INACTIVE (444.555 µs) : 423, 466
.   : milestone, 445,
iast_TELEMETRY_OFF (471.142 µs) : 450, 492
.   : milestone, 471,
tracing (436.859 µs) : 416, 458
.   : milestone, 437,
section candidate
no_agent (367.081 µs) : 346, 388
.   : milestone, 367,
iast (473.225 µs) : 452, 494
.   : milestone, 473,
iast_FULL (530.173 µs) : 510, 551
.   : milestone, 530,
iast_GLOBAL (500.35 µs) : 479, 522
.   : milestone, 500,
iast_HARDCODED_SECRET_DISABLED (469.793 µs) : 449, 491
.   : milestone, 470,
iast_INACTIVE (438.99 µs) : 418, 460
.   : milestone, 439,
iast_TELEMETRY_OFF (465.933 µs) : 445, 487
.   : milestone, 466,
tracing (436.1 µs) : 416, 456
.   : milestone, 436,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 362.985 µs [343.2 µs, 382.77 µs] -
iast 467.725 µs [447.225 µs, 488.225 µs] 104.74 µs (28.9%)
iast_FULL 534.042 µs [513.471 µs, 554.612 µs] 171.057 µs (47.1%)
iast_GLOBAL 500.878 µs [479.555 µs, 522.201 µs] 137.893 µs (38.0%)
iast_HARDCODED_SECRET_DISABLED 466.864 µs [446.397 µs, 487.33 µs] 103.879 µs (28.6%)
iast_INACTIVE 444.555 µs [422.999 µs, 466.111 µs] 81.57 µs (22.5%)
iast_TELEMETRY_OFF 471.142 µs [450.283 µs, 492.0 µs] 108.157 µs (29.8%)
tracing 436.859 µs [416.041 µs, 457.677 µs] 73.874 µs (20.4%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 367.081 µs [345.958 µs, 388.204 µs] -
iast 473.225 µs [451.973 µs, 494.478 µs] 106.145 µs (28.9%)
iast_FULL 530.173 µs [509.576 µs, 550.769 µs] 163.092 µs (44.4%)
iast_GLOBAL 500.35 µs [478.78 µs, 521.92 µs] 133.269 µs (36.3%)
iast_HARDCODED_SECRET_DISABLED 469.793 µs [449.07 µs, 490.516 µs] 102.712 µs (28.0%)
iast_INACTIVE 438.99 µs [418.166 µs, 459.815 µs] 71.91 µs (19.6%)
iast_TELEMETRY_OFF 465.933 µs [445.253 µs, 486.613 µs] 98.852 µs (26.9%)
tracing 436.1 µs [415.988 µs, 456.212 µs] 69.019 µs (18.8%)

if (serviceRates == null) {
return DEFAULT;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines returning DEFAULT are the bugs in the original implementation.

Checking case insensitivity
Checking partial match of service & partial match of env
Checking use of fallback

where:
service | env | expectedRate
"foo" | "bar" | 0.8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests to verify case insensitivity
And check that partial matches uses the fallback

def tracer = tracerBuilder().writer(new ListWriter()).build()

when:
def response = '{"rate_by_service": {"service:spock,env:test":1.0}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add one more rule ,"service:SPOCK,env:Test":0.0 to make sure we verify case-insensitivity otherwise this test will pass even if it's case-sensitive b/o the default case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another commit with that test in addition to some others that check case insensitive equivalent service/env pairs

Copy link
Contributor

@ygree ygree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left one proposal for the test to make sure we are testing case-insensitivity and not the default case

Adding tests that cover case insensitive equivalent service/env pairs
Not defined in spec, but in Java tracer implemented as first one wins

Also added tests to cover partial collisions (e.g. just service or just env) -- both same case and differing case
Copy link
Contributor

@PerfectSlayer PerfectSlayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't had the time to submit my review before Yury approved it but I wanted to share I appreciated the inline comments about the refactoring part, thanks for taking time to add them 👍

@dougqh dougqh merged commit f90cc8a into master Feb 8, 2024
79 checks passed
@dougqh dougqh deleted the dougqh/agent-sampling-feedback-fixes branch February 8, 2024 18:35
@github-actions github-actions bot added this to the 1.30.0 milestone Feb 8, 2024
@dougqh dougqh restored the dougqh/agent-sampling-feedback-fixes branch May 2, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants