[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

ivoanjo · 2023-12-08T16:46:45Z

What does this PR do?

This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead.

Here's how it looks in timeline mode:

Here's how it looks in flamegraph mode:

A TL;DR description of how it works now is below:

Wiring it up

Ruby exposes native-level tracepoint events for when it starts/stops doing GC (RUBY_INTERNAL_EVENT_GC_ENTER/RUBY_INTERNAL_EVENT_GC_EXIT).
Inside the CpuAndWallTimeWorker we create a tracepoint that listens for these events and calls the on_gc_start and on_gc_finish events on the ThreadContext collector.
The RUBY_INTERNAL_EVENT_GC_ENTER/RUBY_INTERNAL_EVENT_GC_EXIT tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything.
We can't do all the work we want in on_gc_finish because we can't allocate anything, so that function returns a flag that tells the CpuAndWallTimeWorker to call sample_after_gc as soon as the GC finishes.

Tracking

When a thread starts doing GC (on_gc_start), we record its current cpu-time and wall-time.
When a thread stops doing GC (on_gc_finish), we record how much cpu-time and wall-time elapsed.
When sample_after_gc gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread.

Overhead

To avoid impacting applications a lot, the Ruby GC works incrementally. That is, rather than doing a full GC (marking, sweeping) in a single step, Ruby breaks this into smaller steps.
We get told about every smaller step: we get a pair of on_gc_start/on_gc_finish calls for each step.
In the past, we always called sample_after_gc after every on_gc_finish to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls to on_gc_start/on_gc_finish per second, which obviously would represent a lot of overhead if we create one GC sample for each of these.
To avoid this overhead, we coalesce/accumulate the cpu/wall-time from multiple calls to on_gc_start/on_gc_finish and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first).

Compared to the old implementation, here's what changed:

Garbage collection events get attributed to a placeholder "Garbage Collection" thread
We no longer record "Garbage Collection" samples on the thread that triggered GC. (And thus no longer include the stack of that thread when recording GC information). This enabled a nice cleanup of the code in the stack collector
GCs across several threads can be included in the same sample
Timeline support is included
Extra metadata about the GC can be seen in the timeline
Overhead is lowered, even during heavy periods of GC action

Motivation:

I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :)

Additional Notes:

I'm opening this PR on top of @AlexJF's #3281 because otherwise both were going to conflict, given the changes to the stack recorder. The two PRs are otherwise completely independent.

I'll wait until #3281 is merged to master to merge this PR as well.

How to test the change?

Test coverage is included.

During development, this tiny benchmark was useful:

require 'benchmark/ips'
require 'ddtrace'

puts RUBY_DESCRIPTION

Datadog.configure do |c|
  c.profiling.enabled = true
  c.profiling.advanced.allocation_counting_enabled = false
  c.telemetry.enabled = false
end

class Potato
  def initialize(val)
    @potato = val
  end
end

Benchmark.ips do |x|
  x.config(time: 10, warmup: 2)

  x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)")

  x.compare!
end

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

ext/ddtrace_profiling_native_extension/collectors_thread_context.c

AlexJF · 2023-12-11T11:57:05Z

ext/ddtrace_profiling_native_extension/collectors_thread_context.c

+      .wall_time_ns = state->gc_tracking.accumulated_wall_time_ns,
+      .timeline_wall_time_ns = state->gc_tracking.accumulated_wall_time_ns,


I don't get the need for both? Is this a backend limitation (i.e., why doesn't it just parse wall_time_ns values directly)? Why does it (seem to) work fine for non-GC frames?

This is a great question. I've pushed a note in 391db52 to address this:

// This event gets both a regular cpu/wall-time duration, as a normal cpu/wall-time sample would, as well as a // timeline duration. // This is done to enable two use-cases: // * regular cpu/wall-time makes this event show up as a regular stack in the flamegraph // * the timeline duration is used when the event shows up in the timeline

In practice, we could change the backend to use get the cpu/wall-time timeline duration and use it for the timeline event, but doing it this way keeps us consistent with what other runtimes do for timeline (e.g. dotnet), and thus makes it easy to share code across them.

codecov-commenter · 2023-12-11T13:30:14Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (56745c9) 98.22% compared to head (807a519) 98.23%.
Report is 34 commits behind head on alexjf/prof-8667-heap-profiling-part1.

Files	Patch %	Lines
...atadog/profiling/collectors/thread_context_spec.rb	99.16%	1 Missing ⚠️

Additional details and impacted files

@@                          Coverage Diff                           @@
##           alexjf/prof-8667-heap-profiling-part1    #3313   +/-   ##
======================================================================
  Coverage                                  98.22%   98.23%           
======================================================================
  Files                                       1253     1253           
  Lines                                      72794    72774   -20     
  Branches                                    3419     3414    -5     
======================================================================
- Hits                                       71504    71486   -18     
+ Misses                                      1290     1288    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexJF

Still LGTM, 👍 Heads-up on incoming conflicts 😄

marcotc

Awesome work, @ivoanjo!

When this feature was added, we weren't yet working an allocation profiling, so having stacks would hint at where an allocation was that triggered the GC. This... is basically a very crappy allocation profiler, and as an anti-bonus it can also get quite expensive. As part of making this feature shape up for us to have it on by default, let's no longer collect stacks. As an alternative, you can get a similar context either from allocation profiling OR by looking at thread timeline for what the thread was doing around the time the GC happened.

These abstractions added some complexity to the stack collector, and were only needed to allow us to put extra placeholder frames on top of a true collected stack. Since this was only being used by the GC placeholder feature, and we've changed that to not collect a stack anymore, we can clean up this mechanism and simplify the code.

This function is rather trivial BUT we have a really nice big comment to explain it so it's useful to extract it so we can have the comment separate, instead of moving the big comment inside the already very-complex `sample_thread_internal`.

While still accurate, I don't think this comment is still relevant -- we no longer have the skid issue, and the rest is just historical context that I don't think is that interesting to keep around.

This event type will be used to report GC information for use in the profiling timeline.

This helper takes care of mapping `rb_gc_latest_gc_info` into metadata that gets added to GC samples. This helper will get tested via the thread context specs (changes upcoming).

@potato

…overhead **What does this PR do?** This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead. A TL;DR description of how it works now is below: 1. Wiring it up * Ruby exposes native-level tracepoint events for when it starts/stops doing GC (`RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`). * Inside the `CpuAndWallTimeWorker` we create a tracepoint that listens for these events and calls the `on_gc_start` and `on_gc_finish` events on the `ThreadContext` collector. * The `RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT` tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything. * We can't do all the work we want in `on_gc_finish` because we can't allocate anything, so that function returns a flag that tells the `CpuAndWallTimeWorker` to call `sample_after_gc` as soon as the GC finishes. 2. Tracking * When a thread starts doing GC (`on_gc_start`), we record its current cpu-time and wall-time. * When a thread stops doing GC (`on_gc_finish`), we record how much cpu-time and wall-time elapsed. * When `sample_after_gc` gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread. 3. Overhead * To avoid impacting applications a lot, the Ruby GC works incrementally. That is, rather than doing a full GC (marking, sweeping) in a single step, Ruby breaks this into smaller steps. * We get told about every smaller step: we get a pair of `on_gc_start` /`on_gc_finish` calls for each step. * In the past, we always called `sample_after_gc` after every `on_gc_finish` to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls to `on_gc_start`/`on_gc_finish` per second, which obviously would represent a lot of overhead if we create one GC sample for each of these. * To avoid this overhead, we coalesce/accumulate the cpu/wall-time from multiple calls to `on_gc_start`/`on_gc_finish` and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first). Compared to the old implementation, here's what changed: * Garbage collection events get attributed to a placeholder "Garbage Collection" thread. * We no longer record "Garbage Collection" samples on the thread that triggered GC. (And thus no longer include the stack of that thread when recording GC information). This enabled a nice cleanup of the code in the stack collector * GCs across several threads can be included in the same sample * Timeline support is included * Extra metadata about the GC can be seen in the timeline * Overhead is lowered, even during heavy periods of GC action **Motivation:** I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :) **Additional Notes:** N/A **How to test the change?** Test coverage is included. During development, this tiny benchmark was useful: ```ruby require 'benchmark/ips' require 'ddtrace' puts RUBY_DESCRIPTION Datadog.configure do |c| c.profiling.enabled = true c.profiling.advanced.allocation_counting_enabled = false c.telemetry.enabled = false end class Potato def initialize(val) @potato = val end end Benchmark.ips do |x| x.config(time: 10, warmup: 2) x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)") x.compare! end ```

Instead, we expose the `record_placeholder_stack` directly.

No longer needed!

…cer-2.7 docker image ships This allows us to keep using `-Werror` in CI without issue, which is pretty valuable.

Because GC samples can be coalesced, we can't assert on 5 GCs resulting in exactly 5 GC events.

… timeline duration

ivoanjo · 2023-12-13T14:32:15Z

Update: This PR was originally opened on top of #3281 but there were a number of changes to that branch, so I chose to rebase on top of master and force-push.

The rebase was clean other than a manual fix in profiling/component.rb + a new spec from master that needed updating ( bfdd1b7 ). No other changes were made.

It seems that even with `GC.start()` we can observe different reasons, so I've relaxed the spec.

Because GC samples are pushed with `thread id: GC`, this was breaking `object_id_from` in tests that happened to have some GC samples. There is no `object_id` for the "GC thread" because there is no such thread -- it's just a way of representing GC information.

ivoanjo · 2023-12-14T08:51:29Z

Thanks y'all for the help and reviews. Merging away!

ivoanjo requested review from a team as code owners December 8, 2023 16:46

github-actions bot added the profiling Involves Datadog profiling label Dec 8, 2023

AlexJF approved these changes Dec 11, 2023

View reviewed changes

ext/ddtrace_profiling_native_extension/collectors_thread_context.c Show resolved Hide resolved

ext/ddtrace_profiling_native_extension/collectors_thread_context.c Outdated Show resolved Hide resolved

AlexJF reviewed Dec 11, 2023

View reviewed changes

AlexJF approved these changes Dec 11, 2023

View reviewed changes

marcotc approved these changes Dec 11, 2023

View reviewed changes

Base automatically changed from alexjf/prof-8667-heap-profiling-part1 to master December 12, 2023 12:28

ivoanjo added 19 commits December 13, 2023 13:27

Refactor placeholder stack recording to be reusable

f3da457

Remove old comment

7ce8d90

While still accurate, I don't think this comment is still relevant -- we no longer have the skid issue, and the rest is just historical context that I don't think is that interesting to keep around.

Add timeline event type to stack recorder

c376568

This event type will be used to report GC information for use in the profiling timeline.

Introduce gc profiling helper

14d9789

This helper takes care of mapping `rb_gc_latest_gc_info` into metadata that gets added to GC samples. This helper will get tested via the thread context specs (changes upcoming).

Minor: Add long_max_of/long_min_of helpers

c64d78a

Remove SAMPLE_IN_GC enum entry

a2c8913

Instead, we expose the `record_placeholder_stack` directly.

Remove sample_type enum

ab39dce

No longer needed!

Fix rubocop nitpick

f3a5c97

Tweak code to avoid triggering warnings on the buggy gcc that our tra…

b78d772

…cer-2.7 docker image ships This allows us to keep using `-Werror` in CI without issue, which is pretty valuable.

Tweak spec to avoid flakiness

d97f4dc

Because GC samples can be coalesced, we can't assert on 5 GCs resulting in exactly 5 GC events.

Update outdated comment

0127e3a

Reorder steps in on_gc_finish as suggested during PR review

e8d7b4e

Add explanation for why we add both cpu/wall-time duration as well as…

c6b791d

… timeline duration

Avoid duplicate boilerplate when calling record_placeholder_stack

2265fc3

Minor fix to spec after branch rebase on top of master

bfdd1b7

ivoanjo force-pushed the ivoanjo/refactor-gc-tracking-timeline-try3 branch from 807a519 to bfdd1b7 Compare December 13, 2023 14:30

ivoanjo added 2 commits December 13, 2023 15:13

Fix spec flakyness

5f42107

It seems that even with `GC.start()` we can observe different reasons, so I've relaxed the spec.

Fix another flaky spec

a3bdd74

Because GC samples are pushed with `thread id: GC`, this was breaking `object_id_from` in tests that happened to have some GC samples. There is no `object_id` for the "GC thread" because there is no such thread -- it's just a way of representing GC information.

ivoanjo merged commit 94d7371 into master Dec 14, 2023
218 checks passed

ivoanjo deleted the ivoanjo/refactor-gc-tracking-timeline-try3 branch December 14, 2023 08:51

github-actions bot added this to the 1.19.0 milestone Dec 14, 2023

ivoanjo restored the ivoanjo/refactor-gc-tracking-timeline-try3 branch December 14, 2023 11:49

ivoanjo deleted the ivoanjo/refactor-gc-tracking-timeline-try3 branch January 9, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

ivoanjo commented Dec 8, 2023

AlexJF Dec 11, 2023 •

edited

ivoanjo Dec 11, 2023

codecov-commenter commented Dec 11, 2023

AlexJF left a comment

marcotc left a comment

ivoanjo commented Dec 13, 2023

ivoanjo commented Dec 14, 2023

		.wall_time_ns = state->gc_tracking.accumulated_wall_time_ns,
		.timeline_wall_time_ns = state->gc_tracking.accumulated_wall_time_ns,

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

Conversation

ivoanjo commented Dec 8, 2023

AlexJF Dec 11, 2023 • edited

Choose a reason for hiding this comment

ivoanjo Dec 11, 2023

Choose a reason for hiding this comment

codecov-commenter commented Dec 11, 2023

Codecov Report

AlexJF left a comment

Choose a reason for hiding this comment

marcotc left a comment

Choose a reason for hiding this comment

ivoanjo commented Dec 13, 2023

ivoanjo commented Dec 14, 2023

AlexJF Dec 11, 2023 •

edited