profiler: randomize when execution traces are collected #2401

nsrip-dd · 2023-11-29T20:23:24Z

What does this PR do?

Randomize execution trace collection. Each profiling cycle we record an
execution trace with probability (profiling period) / (trace period). This
way we still maintain the same desired avarage data rate of one trace every 15
minutes by default.

Inline the shouldTrace function into the one place it should actually be
used. Prior to this commit, shouldTrace was called from two places: once
to decide whether to trace, and once as a double-check right before
starting the trace. The double-check is not particularly helpful, and if
we're making a decision randomly, then checking twice means we'd have to
win the "coin toss" twice to record data. So we inline the logic into a
single place, right before scheduling a trace.

This is tested by doing many "trials" of recording profiles with
different execution trace configurations, seeing how many traces we get,
and making assertions about the number we see based on the expected
probability of tracing. On the one hand, the actual change is fairly
simple so perhaps this level of testing is overkill. We also are
deliberatly introducing a "flaky" test. On the other hand, the first
draft of this change had a bug from calling shouldTrace twice, and the
added test catches that quite consistently. The test could be even
stronger but perhaps this is good enough to start.

Motivation

We currently record execution traces at a fixed interval. This means
that apps which are deployed across several instances simulatenously
will have time periods where no instances have execution trace data.
This also biases us against activity which occurs with a frequency which
is harmonic with the trace collection frequency. To fully address this
we would need to decouple execution trace collection from the normal
profiling cycle. But as a first step, we should give every profiling
cycle a chance of recording data.

Reviewer's Checklist

Changed code has unit tests for its functionality at or near 100% coverage.
System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
There is a benchmark for any new code, or changes to existing code.
If this interacts with the agent in a new way, a system test has been added.
Add an appropriate team label so this PR gets put in the right place for the release notes.

For Datadog employees:

If this PR touches code that handles credentials of any kind, such as Datadog API keys, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

pr-commenter · 2023-11-29T20:36:42Z

Benchmarks

Benchmark execution time: 2023-12-15 14:52:28

Comparing candidate commit b684ba8 in PR branch PROF-8728-randomize-execution-trace-collection with baseline commit 5a92f7b in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 39 metrics, 2 unstable metrics.

profiler/profiler.go

felixge

LGTM 🙇 . I reviewed the code and ran the test cases with -race -count 20.

profiler/profiler_test.go

We currently record execution traces at a fixed interval. This means that apps which are deployed across several instances simulatenously will have time periods where no instances have execution trace data. This also biases us against activity which occurs with a frequency which is harmonic with the trace collection frequency. To fully address this we would need to decouple execution trace collection from the normal profiling cycle. But as a first step, we should give every profiling cycle a chance of recording data. This commit does that: each profiling cycle we record an execution trace with probability (profiling period) / (trace period). This way we still maintain the same desired avarage data rate of ~one trace every 15 minutes by default. We have one special case, though: we want a trace for the first profile cycle to capture startup activity, since that's usually quite different than normal program activity. Inline the shouldTrace function into the one place it should actually be used. Prior to this commit, shouldTrace was called from two places: once to decide whether to trace, and once as a double-check right before starting the trace. The double-check is not particularly helpful, and if we're making a decision randomly, then checking twice means we'd have to win the "coin toss" twice to record data. So we inline the logic into a single place, right before scheduling a trace. This is tested by doing many "trials" of recording profiles with different execution trace configurations, seeing how many traces we get, and making assertions about the number we see based on the expected probability of tracing. On the one hand, the actual change is fairly simple so perhaps this level of testing is overkill. We also are deliberatly introducing a "flaky" test. On the other hand, the first draft of this change had a bug from calling shouldTrace twice, and the added test catches that quite consistently. The test could be even stronger but perhaps this is good enough to start.

nsrip-dd

Thanks Felix! Per our discussion, I added a special case to make sure we trace during startup, and updated the test case to reflect that.

profiler/profiler_test.go

felixge

LGTM! 🚢 it!

nsrip-dd added the profiler label Nov 29, 2023

nsrip-dd commented Nov 30, 2023

View reviewed changes

profiler/profiler.go Outdated Show resolved Hide resolved

nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch from a04d37d to 5239976 Compare December 1, 2023 15:46

nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch from 5239976 to 0c6d7d7 Compare December 13, 2023 19:04

nsrip-dd marked this pull request as ready for review December 13, 2023 19:32

nsrip-dd requested a review from a team as a code owner December 13, 2023 19:32

nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch 2 times, most recently from 8441dfc to 928f0b9 Compare December 14, 2023 16:45

felixge approved these changes Dec 15, 2023

View reviewed changes

profiler/profiler_test.go Show resolved Hide resolved

profiler/profiler_test.go Outdated Show resolved Hide resolved

profiler/profiler_test.go Outdated Show resolved Hide resolved

nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch from 928f0b9 to b684ba8 Compare December 15, 2023 14:37

nsrip-dd commented Dec 15, 2023

View reviewed changes

profiler/profiler_test.go Show resolved Hide resolved

profiler/profiler_test.go Outdated Show resolved Hide resolved

profiler/profiler_test.go Outdated Show resolved Hide resolved

felixge approved these changes Dec 15, 2023

View reviewed changes

Merge branch 'main' into PROF-8728-randomize-execution-trace-collection

0157cba

nsrip-dd enabled auto-merge (squash) December 16, 2023 00:04

nsrip-dd merged commit e44bd72 into main Dec 16, 2023
62 checks passed

nsrip-dd deleted the PROF-8728-randomize-execution-trace-collection branch December 16, 2023 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiler: randomize when execution traces are collected #2401

profiler: randomize when execution traces are collected #2401

nsrip-dd commented Nov 29, 2023 •

edited

pr-commenter bot commented Nov 29, 2023 •

edited

felixge left a comment

nsrip-dd left a comment

felixge left a comment

profiler: randomize when execution traces are collected #2401

profiler: randomize when execution traces are collected #2401

Conversation

nsrip-dd commented Nov 29, 2023 • edited

What does this PR do?

Motivation

Reviewer's Checklist

pr-commenter bot commented Nov 29, 2023 • edited

Benchmarks

felixge left a comment

Choose a reason for hiding this comment

nsrip-dd left a comment

Choose a reason for hiding this comment

felixge left a comment

Choose a reason for hiding this comment

nsrip-dd commented Nov 29, 2023 •

edited

pr-commenter bot commented Nov 29, 2023 •

edited