[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

When this feature was added, we weren't yet working an allocation profiling, so having stacks would hint at where an allocation was that triggered the GC. This... is basically a very crappy allocation profiler, and as an anti-bonus it can also get quite expensive. As part of making this feature shape up for us to have it on by default, let's no longer collect stacks. As an alternative, you can get a similar context either from allocation profiling OR by looking at thread timeline for what the thread was doing around the time the GC happened.

These abstractions added some complexity to the stack collector, and were only needed to allow us to put extra placeholder frames on top of a true collected stack. Since this was only being used by the GC placeholder feature, and we've changed that to not collect a stack anymore, we can clean up this mechanism and simplify the code.

This function is rather trivial BUT we have a really nice big comment to explain it so it's useful to extract it so we can have the comment separate, instead of moving the big comment inside the already very-complex `sample_thread_internal`.

While still accurate, I don't think this comment is still relevant -- we no longer have the skid issue, and the rest is just historical context that I don't think is that interesting to keep around.

This event type will be used to report GC information for use in the profiling timeline.

This helper takes care of mapping `rb_gc_latest_gc_info` into metadata that gets added to GC samples. This helper will get tested via the thread context specs (changes upcoming).

@potato

…overhead **What does this PR do?** This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead. A TL;DR description of how it works now is below: 1. Wiring it up * Ruby exposes native-level tracepoint events for when it starts/stops doing GC (`RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`). * Inside the `CpuAndWallTimeWorker` we create a tracepoint that listens for these events and calls the `on_gc_start` and `on_gc_finish` events on the `ThreadContext` collector. * The `RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT` tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything. * We can't do all the work we want in `on_gc_finish` because we can't allocate anything, so that function returns a flag that tells the `CpuAndWallTimeWorker` to call `sample_after_gc` as soon as the GC finishes. 2. Tracking * When a thread starts doing GC (`on_gc_start`), we record its current cpu-time and wall-time. * When a thread stops doing GC (`on_gc_finish`), we record how much cpu-time and wall-time elapsed. * When `sample_after_gc` gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread. 3. Overhead * To avoid impacting applications a lot, the Ruby GC works incrementally. That is, rather than doing a full GC (marking, sweeping) in a single step, Ruby breaks this into smaller steps. * We get told about every smaller step: we get a pair of `on_gc_start` /`on_gc_finish` calls for each step. * In the past, we always called `sample_after_gc` after every `on_gc_finish` to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls to `on_gc_start`/`on_gc_finish` per second, which obviously would represent a lot of overhead if we create one GC sample for each of these. * To avoid this overhead, we coalesce/accumulate the cpu/wall-time from multiple calls to `on_gc_start`/`on_gc_finish` and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first). Compared to the old implementation, here's what changed: * Garbage collection events get attributed to a placeholder "Garbage Collection" thread. * We no longer record "Garbage Collection" samples on the thread that triggered GC. (And thus no longer include the stack of that thread when recording GC information). This enabled a nice cleanup of the code in the stack collector * GCs across several threads can be included in the same sample * Timeline support is included * Extra metadata about the GC can be seen in the timeline * Overhead is lowered, even during heavy periods of GC action **Motivation:** I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :) **Additional Notes:** N/A **How to test the change?** Test coverage is included. During development, this tiny benchmark was useful: ```ruby require 'benchmark/ips' require 'ddtrace' puts RUBY_DESCRIPTION Datadog.configure do |c| c.profiling.enabled = true c.profiling.advanced.allocation_counting_enabled = false c.telemetry.enabled = false end class Potato def initialize(val) @potato = val end end Benchmark.ips do |x| x.config(time: 10, warmup: 2) x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)") x.compare! end ```

Instead, we expose the `record_placeholder_stack` directly.

No longer needed!

…cer-2.7 docker image ships This allows us to keep using `-Werror` in CI without issue, which is pretty valuable.

Because GC samples can be coalesced, we can't assert on 5 GCs resulting in exactly 5 GC events.

… timeline duration

It seems that even with `GC.start()` we can observe different reasons, so I've relaxed the spec.

Because GC samples are pushed with `thread id: GC`, this was breaking `object_id_from` in tests that happened to have some GC samples. There is no `object_id` for the "GC thread" because there is no such thread -- it's just a way of representing GC information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

Commits on Dec 13, 2023