Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: randomize when execution traces are collected #2401

Merged
merged 2 commits into from Dec 16, 2023

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented Nov 29, 2023

What does this PR do?

Randomize execution trace collection. Each profiling cycle we record an
execution trace with probability (profiling period) / (trace period). This
way we still maintain the same desired avarage data rate of one trace every 15
minutes by default.

Inline the shouldTrace function into the one place it should actually be
used. Prior to this commit, shouldTrace was called from two places: once
to decide whether to trace, and once as a double-check right before
starting the trace. The double-check is not particularly helpful, and if
we're making a decision randomly, then checking twice means we'd have to
win the "coin toss" twice to record data. So we inline the logic into a
single place, right before scheduling a trace.

This is tested by doing many "trials" of recording profiles with
different execution trace configurations, seeing how many traces we get,
and making assertions about the number we see based on the expected
probability of tracing. On the one hand, the actual change is fairly
simple so perhaps this level of testing is overkill. We also are
deliberatly introducing a "flaky" test. On the other hand, the first
draft of this change had a bug from calling shouldTrace twice, and the
added test catches that quite consistently. The test could be even
stronger but perhaps this is good enough to start.

Motivation

We currently record execution traces at a fixed interval. This means
that apps which are deployed across several instances simulatenously
will have time periods where no instances have execution trace data.
This also biases us against activity which occurs with a frequency which
is harmonic with the trace collection frequency. To fully address this
we would need to decouple execution trace collection from the normal
profiling cycle. But as a first step, we should give every profiling
cycle a chance of recording data.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.

For Datadog employees:

  • If this PR touches code that handles credentials of any kind, such as Datadog API keys, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@pr-commenter
Copy link

pr-commenter bot commented Nov 29, 2023

Benchmarks

Benchmark execution time: 2023-12-15 14:52:28

Comparing candidate commit b684ba8 in PR branch PROF-8728-randomize-execution-trace-collection with baseline commit 5a92f7b in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 39 metrics, 2 unstable metrics.

profiler/profiler.go Outdated Show resolved Hide resolved
@nsrip-dd nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch from a04d37d to 5239976 Compare December 1, 2023 15:46
@nsrip-dd nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch from 5239976 to 0c6d7d7 Compare December 13, 2023 19:04
@nsrip-dd nsrip-dd marked this pull request as ready for review December 13, 2023 19:32
@nsrip-dd nsrip-dd requested a review from a team as a code owner December 13, 2023 19:32
@nsrip-dd nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch 2 times, most recently from 8441dfc to 928f0b9 Compare December 14, 2023 16:45
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙇 . I reviewed the code and ran the test cases with -race -count 20.

profiler/profiler_test.go Show resolved Hide resolved
profiler/profiler_test.go Outdated Show resolved Hide resolved
profiler/profiler_test.go Outdated Show resolved Hide resolved
We currently record execution traces at a fixed interval. This means
that apps which are deployed across several instances simulatenously
will have time periods where no instances have execution trace data.
This also biases us against activity which occurs with a frequency which
is harmonic with the trace collection frequency. To fully address this
we would need to decouple execution trace collection from the normal
profiling cycle. But as a first step, we should give every profiling
cycle a chance of recording data. This commit does that: each profiling
cycle we record an execution trace with probability (profiling period) /
(trace period). This way we still maintain the same desired avarage data
rate of ~one trace every 15 minutes by default. We have one special
case, though: we want a trace for the first profile cycle to capture
startup activity, since that's usually quite different than normal
program activity.

Inline the shouldTrace function into the one place it should actually be
used. Prior to this commit, shouldTrace was called from two places: once
to decide whether to trace, and once as a double-check right before
starting the trace. The double-check is not particularly helpful, and if
we're making a decision randomly, then checking twice means we'd have to
win the "coin toss" twice to record data. So we inline the logic into a
single place, right before scheduling a trace.

This is tested by doing many "trials" of recording profiles with
different execution trace configurations, seeing how many traces we get,
and making assertions about the number we see based on the expected
probability of tracing. On the one hand, the actual change is fairly
simple so perhaps this level of testing is overkill. We also are
deliberatly introducing a "flaky" test. On the other hand, the first
draft of this change had a bug from calling shouldTrace twice, and the
added test catches that quite consistently. The test could be even
stronger but perhaps this is good enough to start.
@nsrip-dd nsrip-dd force-pushed the PROF-8728-randomize-execution-trace-collection branch from 928f0b9 to b684ba8 Compare December 15, 2023 14:37
Copy link
Contributor Author

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Felix! Per our discussion, I added a special case to make sure we trace during startup, and updated the test case to reflect that.

profiler/profiler_test.go Show resolved Hide resolved
profiler/profiler_test.go Outdated Show resolved Hide resolved
profiler/profiler_test.go Outdated Show resolved Hide resolved
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢 it!

@nsrip-dd nsrip-dd enabled auto-merge (squash) December 16, 2023 00:04
@nsrip-dd nsrip-dd merged commit e44bd72 into main Dec 16, 2023
62 checks passed
@nsrip-dd nsrip-dd deleted the PROF-8728-randomize-execution-trace-collection branch December 16, 2023 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants