-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9035] Stop worker on clock failure #3439
[PROF-9035] Stop worker on clock failure #3439
Conversation
…of-9045-alloc-sampling-spikes-within-windows
b5f23a2
to
6234ae2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #3439 +/- ##
==========================================
- Coverage 98.23% 98.23% -0.01%
==========================================
Files 1277 1277
Lines 75141 75175 +34
Branches 3544 3547 +3
==========================================
+ Hits 73817 73849 +32
- Misses 1324 1326 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more notes.
I must say, after going through this PR it's definitely... a very big change to enable the discrete sampler to say "hey, the clock is broken".
I kinda wonder if instead we should move the responsibility of checking the clock back to the CpuAndWallTimeWorker
. It already needs to check for errors when calling the discrete sampler so it's not like we're "hiding" a lot of complexity away from it.
On the other hand I kinda realize my suggestion may be ~slightly cruel after you spent all this time on this PR (and I did a bit more on top as I was reviewing it -- the thought only kinda occurred to me at the end).
Also, the current mechanism is definitely more flexible in terms of allowing many more error states to be introduced, whereas my suggestion only really solves the issue of the clock error.
I leave the decision up to you ;)
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
ext/ddtrace_profiling_native_extension/collectors_discrete_dynamic_sampler.c
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
What does this PR do?
When detecting an interaction that involves a clock time of 0 (returned on error), put the worker into a failure state and ultimately lead to it stopping.
Motivation:
We can't trust things if we have non working clocks and better to shut things down than to continue consuming resources.
Additional Notes:
How to test the change?
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!