Improve clearing of profile data after Ruby app forks #2362

ivoanjo · 2022-11-10T11:00:27Z

What does this PR do?:

When a Ruby app forks, we automatically restart the profiler in the child process as well. This ensures that in apps such as forking webservers we still provide full profiling coverage.

As part of automatically restarting the profiler in the child process, we want to clear any profiling data that has accumulated from the parent, as otherwise both the parent as well as the child process would report that data.

This "clearing of profiling data" happens in Scheduler#after_fork. Previously, this method would "clear data" by asking the exporter for the latest profile and then throwing this away.

This worked but it was rather heavy handed -- it did some useless work, as well as generated a few slightly-confusing debug logs:

DEBUG (dd-trace-rb/lib/datadog/profiling/scheduler.rb:73:in `after_fork') Flushing exporter in child process #after_fork and discarding data
DEBUG (dd-trace-rb/lib/datadog/profiling/stack_recorder.rb:25:in `serialize') Encoded profile covering 2022-11-10T09:42:22Z to 2022-11-10T09:42:22Z
DEBUG (dd-trace-rb/lib/datadog/profiling/exporter.rb:52:in `flush') Skipped exporting profiling events as profile duration is below minimum

To improve this for the new Ruby profiler, I've decided to implement a StackRecorder#clear, and then refactor the Scheduler and Exporter to use it instead.

For the old Ruby profiler (still the default, in-use by customers), I chose not to implement a #clear, and thus the old approach of serializing a profile and throwing it away is still used.

Motivation:

I've wanted to improve this for some time, but chose not to because I didn't want to invest on improving the old profiler just to then throw it away. Now that the new profiler is getting in better shape, I finally decided to implement this.

Additional Notes:

This PR is on top of #2359 just for my development convenience; it has no explicit dependencies on it.

How to test the change?:

Validate that profiles from the child process of a Ruby app that forks do not contain any profiling data from the parent process by looking at the profiler UX.

Calling `exporter.flush` after fork to clear any existing profiling data works, but it's rather heavy-handed and can generate a few confusing log messages: ``` DEBUG (dd-trace-rb/lib/datadog/profiling/scheduler.rb:73:in `after_fork') Flushing exporter in child process #after_fork and discarding data DEBUG (dd-trace-rb/lib/datadog/profiling/stack_recorder.rb:25:in `serialize') Encoded profile covering 2022-11-10T09:42:22Z to 2022-11-10T09:42:22Z DEBUG (dd-trace-rb/lib/datadog/profiling/exporter.rb:52:in `flush') Skipped exporting profiling events as profile duration is below minimum ``` To improve this, I've cleared a new `Exporter#clear` that is still (currently) implemented using `flush`, but now the exporter knows what's going on more explicitly, and so it can implement `clear` in a better way.

…eeded The previous `flush` call was a placeholder, and we can now improve it.

This will be used by the `Exporter` when needing to clear data without serializing it.

As the `StackRecorder` implements `#clear` natively, no need to simulate it with a call to `#serialize`. I decided not to bother updating the `OldRecorder`, so for that class we'll still use the old behavior.

codecov-commenter · 2022-11-10T14:10:14Z

Codecov Report

Merging #2362 (d369843) into ivoanjo/prof-5943-handle-vm-forking (eb8fcb6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@                         Coverage Diff                          @@
##           ivoanjo/prof-5943-handle-vm-forking    #2362   +/-   ##
====================================================================
  Coverage                                98.31%   98.32%           
====================================================================
  Files                                     1102     1102           
  Lines                                    58994    59070   +76     
====================================================================
+ Hits                                     58000    58080   +80     
+ Misses                                     994      990    -4

Impacted Files	Coverage Δ
lib/datadog/profiling/exporter.rb	`100.00% <100.00%> (ø)`
lib/datadog/profiling/scheduler.rb	`100.00% <100.00%> (ø)`
lib/datadog/profiling/stack_recorder.rb	`100.00% <100.00%> (ø)`
spec/datadog/profiling/exporter_spec.rb	`100.00% <100.00%> (ø)`
spec/datadog/profiling/scheduler_spec.rb	`100.00% <100.00%> (ø)`
spec/datadog/profiling/stack_recorder_spec.rb	`100.00% <100.00%> (ø)`
lib/datadog/core/diagnostics/environment_logger.rb	`98.42% <0.00%> (-1.58%)`	⬇️
...dog/profiling/collectors/cpu_and_wall_time_spec.rb	`97.05% <0.00%> (-0.50%)`	⬇️
...adog/tracing/contrib/sidekiq/client_tracer_spec.rb	`100.00% <0.00%> (+2.08%)`	⬆️
...ontrib/sidekiq/server_internal_tracer/heartbeat.rb	`64.00% <0.00%> (+28.00%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

spec/datadog/profiling/stack_recorder_spec.rb

ext/ddtrace_profiling_native_extension/stack_recorder.c

In the previous commit, we refactored the old profiler codepath so that the `Profiler` would call `#reset_after_fork` on collectors before restarting them in a forking process. In this commit, we use the added hook to propagate the `#reset_after_fork` call to every component, so that they can clean up their internal state in the child process of a fork. In particular: * The chain starts on the `Collectors::CpuAndWallTimeWorker` which must start by disabling its active `TracePoint`s so that something like GC doesn't trigger a sample while we're still resetting the profiler. * Then the `Collectors::CpuAndWallTime` resets: a. Its per-thread tracking information. The native thread ids and CPU-time tracking of the thread that survived the fork need to be reset + all other threads that did not survive the fork need to be cleared. b. The internal stats need to be reset as well * Then the `StackRecorder` resets: a. The active slot and its concurrency control. This is to avoid any issues if the fork happened while a serialization attempt was ongoing and left the concurrency control in an inconsistent state. b. The profiles, so there's no left over data from the parent process in the child process profiles. The `#reset_after_fork` approach actually made the `StackRecorder#clear` method added recently in #2362 actually not be needed anymore, so I went ahead and removed it.

ivoanjo added 4 commits November 10, 2022 09:51

Improve Exporter#clear by extracting only the operations actually n…

35e0bbe

…eeded The previous `flush` call was a placeholder, and we can now improve it.

Add StackRecorder#clear

ad10f75

This will be used by the `Exporter` when needing to clear data without serializing it.

Use StackRecorder#clear in Exporter

d369843

As the `StackRecorder` implements `#clear` natively, no need to simulate it with a call to `#serialize`. I decided not to bother updating the `OldRecorder`, so for that class we'll still use the old behavior.

ivoanjo requested a review from a team November 10, 2022 11:00

github-actions bot added the profiling Involves Datadog profiling label Nov 10, 2022

Base automatically changed from ivoanjo/prof-5943-handle-vm-forking to master November 10, 2022 14:20

marcotc reviewed Nov 10, 2022

View reviewed changes

spec/datadog/profiling/stack_recorder_spec.rb Outdated Show resolved Hide resolved

marcotc reviewed Nov 10, 2022

View reviewed changes

ext/ddtrace_profiling_native_extension/stack_recorder.c Show resolved Hide resolved

ivoanjo added 2 commits November 11, 2022 10:54

Improve spec descriptions after PR feedback

1f2fdf5

Add note about slot flipping inside clear

96acfb2

marcotc approved these changes Nov 14, 2022

View reviewed changes

ivoanjo mentioned this pull request Nov 15, 2022

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

Merged

ivoanjo merged commit f93d1b3 into master Nov 16, 2022

ivoanjo deleted the ivoanjo/better-profile-clearing branch November 16, 2022 08:41

github-actions bot added this to the 1.7.0 milestone Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve clearing of profile data after Ruby app forks #2362

Improve clearing of profile data after Ruby app forks #2362

ivoanjo commented Nov 10, 2022

codecov-commenter commented Nov 10, 2022

Improve clearing of profile data after Ruby app forks #2362

Improve clearing of profile data after Ruby app forks #2362

Conversation

ivoanjo commented Nov 10, 2022

codecov-commenter commented Nov 10, 2022

Codecov Report