Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9035] Stop worker on clock failure #3439

Merged

Conversation

AlexJF
Copy link
Contributor

@AlexJF AlexJF commented Feb 6, 2024

What does this PR do?
When detecting an interaction that involves a clock time of 0 (returned on error), put the worker into a failure state and ultimately lead to it stopping.

Motivation:
We can't trust things if we have non working clocks and better to shut things down than to continue consuming resources.

Additional Notes:

How to test the change?

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@AlexJF AlexJF requested a review from a team as a code owner February 6, 2024 11:30
@github-actions github-actions bot added the profiling Involves Datadog profiling label Feb 6, 2024
@AlexJF AlexJF force-pushed the alexjf/prof-9035-handle-clock-errors-allocation-sampler branch from b5f23a2 to 6234ae2 Compare February 6, 2024 11:35
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

Base automatically changed from alexjf/prof-9045-alloc-sampling-spikes-within-windows to master February 12, 2024 12:12
@codecov-commenter
Copy link

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (8212e9f) 98.23% compared to head (4a03871) 98.23%.

Files Patch % Lines
...filing/collectors/discrete_dynamic_sampler_spec.rb 95.83% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3439      +/-   ##
==========================================
- Coverage   98.23%   98.23%   -0.01%     
==========================================
  Files        1277     1277              
  Lines       75141    75175      +34     
  Branches     3544     3547       +3     
==========================================
+ Hits        73817    73849      +32     
- Misses       1324     1326       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more notes.

I must say, after going through this PR it's definitely... a very big change to enable the discrete sampler to say "hey, the clock is broken".

I kinda wonder if instead we should move the responsibility of checking the clock back to the CpuAndWallTimeWorker. It already needs to check for errors when calling the discrete sampler so it's not like we're "hiding" a lot of complexity away from it.

On the other hand I kinda realize my suggestion may be ~slightly cruel after you spent all this time on this PR (and I did a bit more on top as I was reviewing it -- the thought only kinda occurred to me at the end).

Also, the current mechanism is definitely more flexible in terms of allowing many more error states to be introduced, whereas my suggestion only really solves the issue of the clock error.

I leave the decision up to you ;)

@AlexJF AlexJF changed the title [PROF-9035] Stop worker on allocation sampler failure [PROF-9035] Stop worker on clock failure Mar 4, 2024
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@AlexJF AlexJF merged commit 52d8d43 into master Mar 5, 2024
219 checks passed
@AlexJF AlexJF deleted the alexjf/prof-9035-handle-clock-errors-allocation-sampler branch March 5, 2024 11:31
@github-actions github-actions bot added this to the 2.0 milestone Mar 5, 2024
@ivoanjo ivoanjo modified the milestones: 2.0, 1.21.0 Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants