New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiler: randomize when execution traces are collected #2401
profiler: randomize when execution traces are collected #2401
Conversation
BenchmarksBenchmark execution time: 2023-12-15 14:52:28 Comparing candidate commit b684ba8 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 39 metrics, 2 unstable metrics. |
a04d37d
to
5239976
Compare
5239976
to
0c6d7d7
Compare
8441dfc
to
928f0b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🙇 . I reviewed the code and ran the test cases with -race -count 20
.
We currently record execution traces at a fixed interval. This means that apps which are deployed across several instances simulatenously will have time periods where no instances have execution trace data. This also biases us against activity which occurs with a frequency which is harmonic with the trace collection frequency. To fully address this we would need to decouple execution trace collection from the normal profiling cycle. But as a first step, we should give every profiling cycle a chance of recording data. This commit does that: each profiling cycle we record an execution trace with probability (profiling period) / (trace period). This way we still maintain the same desired avarage data rate of ~one trace every 15 minutes by default. We have one special case, though: we want a trace for the first profile cycle to capture startup activity, since that's usually quite different than normal program activity. Inline the shouldTrace function into the one place it should actually be used. Prior to this commit, shouldTrace was called from two places: once to decide whether to trace, and once as a double-check right before starting the trace. The double-check is not particularly helpful, and if we're making a decision randomly, then checking twice means we'd have to win the "coin toss" twice to record data. So we inline the logic into a single place, right before scheduling a trace. This is tested by doing many "trials" of recording profiles with different execution trace configurations, seeing how many traces we get, and making assertions about the number we see based on the expected probability of tracing. On the one hand, the actual change is fairly simple so perhaps this level of testing is overkill. We also are deliberatly introducing a "flaky" test. On the other hand, the first draft of this change had a bug from calling shouldTrace twice, and the added test catches that quite consistently. The test could be even stronger but perhaps this is good enough to start.
928f0b9
to
b684ba8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Felix! Per our discussion, I added a special case to make sure we trace during startup, and updated the test case to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚢 it!
What does this PR do?
Randomize execution trace collection. Each profiling cycle we record an
execution trace with probability (profiling period) / (trace period). This
way we still maintain the same desired avarage data rate of one trace every 15
minutes by default.
Inline the shouldTrace function into the one place it should actually be
used. Prior to this commit, shouldTrace was called from two places: once
to decide whether to trace, and once as a double-check right before
starting the trace. The double-check is not particularly helpful, and if
we're making a decision randomly, then checking twice means we'd have to
win the "coin toss" twice to record data. So we inline the logic into a
single place, right before scheduling a trace.
This is tested by doing many "trials" of recording profiles with
different execution trace configurations, seeing how many traces we get,
and making assertions about the number we see based on the expected
probability of tracing. On the one hand, the actual change is fairly
simple so perhaps this level of testing is overkill. We also are
deliberatly introducing a "flaky" test. On the other hand, the first
draft of this change had a bug from calling shouldTrace twice, and the
added test catches that quite consistently. The test could be even
stronger but perhaps this is good enough to start.
Motivation
We currently record execution traces at a fixed interval. This means
that apps which are deployed across several instances simulatenously
will have time periods where no instances have execution trace data.
This also biases us against activity which occurs with a frequency which
is harmonic with the trace collection frequency. To fully address this
we would need to decouple execution trace collection from the normal
profiling cycle. But as a first step, we should give every profiling
cycle a chance of recording data.
Reviewer's Checklist
For Datadog employees:
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!