Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve clearing of profile data after Ruby app forks #2362

Merged
merged 6 commits into from
Nov 16, 2022

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Nov 10, 2022

What does this PR do?:

When a Ruby app forks, we automatically restart the profiler in the child process as well. This ensures that in apps such as forking webservers we still provide full profiling coverage.

As part of automatically restarting the profiler in the child process, we want to clear any profiling data that has accumulated from the parent, as otherwise both the parent as well as the child process would report that data.

This "clearing of profiling data" happens in Scheduler#after_fork. Previously, this method would "clear data" by asking the exporter for the latest profile and then throwing this away.

This worked but it was rather heavy handed -- it did some useless work, as well as generated a few slightly-confusing debug logs:

DEBUG (dd-trace-rb/lib/datadog/profiling/scheduler.rb:73:in `after_fork') Flushing exporter in child process #after_fork and discarding data
DEBUG (dd-trace-rb/lib/datadog/profiling/stack_recorder.rb:25:in `serialize') Encoded profile covering 2022-11-10T09:42:22Z to 2022-11-10T09:42:22Z
DEBUG (dd-trace-rb/lib/datadog/profiling/exporter.rb:52:in `flush') Skipped exporting profiling events as profile duration is below minimum

To improve this for the new Ruby profiler, I've decided to implement a StackRecorder#clear, and then refactor the Scheduler and Exporter to use it instead.

For the old Ruby profiler (still the default, in-use by customers), I chose not to implement a #clear, and thus the old approach of serializing a profile and throwing it away is still used.

Motivation:

I've wanted to improve this for some time, but chose not to because I didn't want to invest on improving the old profiler just to then throw it away. Now that the new profiler is getting in better shape, I finally decided to implement this.

Additional Notes:

This PR is on top of #2359 just for my development convenience; it has no explicit dependencies on it.

How to test the change?:

Validate that profiles from the child process of a Ruby app that forks do not contain any profiling data from the parent process by looking at the profiler UX.

Calling `exporter.flush` after fork to clear any existing profiling
data works, but it's rather heavy-handed and can generate a few
confusing log messages:

```
DEBUG (dd-trace-rb/lib/datadog/profiling/scheduler.rb:73:in `after_fork') Flushing exporter in child process #after_fork and discarding data
DEBUG (dd-trace-rb/lib/datadog/profiling/stack_recorder.rb:25:in `serialize') Encoded profile covering 2022-11-10T09:42:22Z to 2022-11-10T09:42:22Z
DEBUG (dd-trace-rb/lib/datadog/profiling/exporter.rb:52:in `flush') Skipped exporting profiling events as profile duration is below minimum
```

To improve this, I've cleared a new `Exporter#clear` that is still
(currently) implemented using `flush`, but now the exporter knows
what's going on more explicitly, and so it can implement `clear`
in a better way.
…eeded

The previous `flush` call was a placeholder, and we can now improve it.
This will be used by the `Exporter` when needing to clear data without
serializing it.
As the `StackRecorder` implements `#clear` natively, no need to
simulate it with a call to `#serialize`.

I decided not to bother updating the `OldRecorder`, so for that
class we'll still use the old behavior.
@ivoanjo ivoanjo requested a review from a team November 10, 2022 11:00
@github-actions github-actions bot added the profiling Involves Datadog profiling label Nov 10, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #2362 (d369843) into ivoanjo/prof-5943-handle-vm-forking (eb8fcb6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@                         Coverage Diff                          @@
##           ivoanjo/prof-5943-handle-vm-forking    #2362   +/-   ##
====================================================================
  Coverage                                98.31%   98.32%           
====================================================================
  Files                                     1102     1102           
  Lines                                    58994    59070   +76     
====================================================================
+ Hits                                     58000    58080   +80     
+ Misses                                     994      990    -4     
Impacted Files Coverage Δ
lib/datadog/profiling/exporter.rb 100.00% <100.00%> (ø)
lib/datadog/profiling/scheduler.rb 100.00% <100.00%> (ø)
lib/datadog/profiling/stack_recorder.rb 100.00% <100.00%> (ø)
spec/datadog/profiling/exporter_spec.rb 100.00% <100.00%> (ø)
spec/datadog/profiling/scheduler_spec.rb 100.00% <100.00%> (ø)
spec/datadog/profiling/stack_recorder_spec.rb 100.00% <100.00%> (ø)
lib/datadog/core/diagnostics/environment_logger.rb 98.42% <0.00%> (-1.58%) ⬇️
...dog/profiling/collectors/cpu_and_wall_time_spec.rb 97.05% <0.00%> (-0.50%) ⬇️
...adog/tracing/contrib/sidekiq/client_tracer_spec.rb 100.00% <0.00%> (+2.08%) ⬆️
...ontrib/sidekiq/server_internal_tracer/heartbeat.rb 64.00% <0.00%> (+28.00%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Base automatically changed from ivoanjo/prof-5943-handle-vm-forking to master November 10, 2022 14:20
ivoanjo added a commit that referenced this pull request Nov 15, 2022
In the previous commit, we refactored the old profiler codepath so that
the `Profiler` would call `#reset_after_fork` on collectors before
restarting them in a forking process.

In this commit, we use the added hook to propagate the
`#reset_after_fork` call to every component, so that they can clean up
their internal state in the child process of a fork.

In particular:
* The chain starts on the `Collectors::CpuAndWallTimeWorker` which must
  start by disabling its active `TracePoint`s so that something like
  GC doesn't trigger a sample while we're still resetting the profiler.

* Then the `Collectors::CpuAndWallTime` resets:
  a. Its per-thread tracking information. The native thread ids and
     CPU-time tracking of the thread that survived the fork need to
     be reset + all other threads that did not survive the fork need
     to be cleared.
  b. The internal stats need to be reset as well

* Then the `StackRecorder` resets:
  a. The active slot and its concurrency control. This is to avoid any
     issues if the fork happened while a serialization attempt was
     ongoing and left the concurrency control in an inconsistent
     state.
  b. The profiles, so there's no left over data from the parent
     process in the child process profiles.

The `#reset_after_fork` approach actually made the
`StackRecorder#clear` method added recently in #2362 actually not
be needed anymore, so I went ahead and removed it.
@ivoanjo ivoanjo merged commit f93d1b3 into master Nov 16, 2022
@ivoanjo ivoanjo deleted the ivoanjo/better-profile-clearing branch November 16, 2022 08:41
@github-actions github-actions bot added this to the 1.7.0 milestone Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants