-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313
Commits on Dec 13, 2023
-
Configuration menu - View commit details
-
Copy full SHA for f3da457 - Browse repository at this point
Copy the full SHA f3da457View commit details -
Garbage Collection samples no longer collect stacks
When this feature was added, we weren't yet working an allocation profiling, so having stacks would hint at where an allocation was that triggered the GC. This... is basically a very crappy allocation profiler, and as an anti-bonus it can also get quite expensive. As part of making this feature shape up for us to have it on by default, let's no longer collect stacks. As an alternative, you can get a similar context either from allocation profiling OR by looking at thread timeline for what the thread was doing around the time the GC happened.
Configuration menu - View commit details
-
Copy full SHA for 3bca251 - Browse repository at this point
Copy the full SHA 3bca251View commit details -
Remove record_buffer/extra_frames_in_record_buffer abstraction
These abstractions added some complexity to the stack collector, and were only needed to allow us to put extra placeholder frames on top of a true collected stack. Since this was only being used by the GC placeholder feature, and we've changed that to not collect a stack anymore, we can clean up this mechanism and simplify the code.
Configuration menu - View commit details
-
Copy full SHA for 1f5926b - Browse repository at this point
Copy the full SHA 1f5926bView commit details -
Extract
record_placeholder_stack_in_native_code
againThis function is rather trivial BUT we have a really nice big comment to explain it so it's useful to extract it so we can have the comment separate, instead of moving the big comment inside the already very-complex `sample_thread_internal`.
Configuration menu - View commit details
-
Copy full SHA for 6f922aa - Browse repository at this point
Copy the full SHA 6f922aaView commit details -
While still accurate, I don't think this comment is still relevant -- we no longer have the skid issue, and the rest is just historical context that I don't think is that interesting to keep around.
Configuration menu - View commit details
-
Copy full SHA for 7ce8d90 - Browse repository at this point
Copy the full SHA 7ce8d90View commit details -
Add timeline event type to stack recorder
This event type will be used to report GC information for use in the profiling timeline.
Configuration menu - View commit details
-
Copy full SHA for c376568 - Browse repository at this point
Copy the full SHA c376568View commit details -
This helper takes care of mapping `rb_gc_latest_gc_info` into metadata that gets added to GC samples. This helper will get tested via the thread context specs (changes upcoming).
Configuration menu - View commit details
-
Copy full SHA for 14d9789 - Browse repository at this point
Copy the full SHA 14d9789View commit details -
Configuration menu - View commit details
-
Copy full SHA for c64d78a - Browse repository at this point
Copy the full SHA c64d78aView commit details -
[NO-TICKET] Redesign GC profiling to add timeline support and reduce …
…overhead **What does this PR do?** This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead. A TL;DR description of how it works now is below: 1. Wiring it up * Ruby exposes native-level tracepoint events for when it starts/stops doing GC (`RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`). * Inside the `CpuAndWallTimeWorker` we create a tracepoint that listens for these events and calls the `on_gc_start` and `on_gc_finish` events on the `ThreadContext` collector. * The `RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT` tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything. * We can't do all the work we want in `on_gc_finish` because we can't allocate anything, so that function returns a flag that tells the `CpuAndWallTimeWorker` to call `sample_after_gc` as soon as the GC finishes. 2. Tracking * When a thread starts doing GC (`on_gc_start`), we record its current cpu-time and wall-time. * When a thread stops doing GC (`on_gc_finish`), we record how much cpu-time and wall-time elapsed. * When `sample_after_gc` gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread. 3. Overhead * To avoid impacting applications a lot, the Ruby GC works incrementally. That is, rather than doing a full GC (marking, sweeping) in a single step, Ruby breaks this into smaller steps. * We get told about every smaller step: we get a pair of `on_gc_start` /`on_gc_finish` calls for each step. * In the past, we always called `sample_after_gc` after every `on_gc_finish` to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls to `on_gc_start`/`on_gc_finish` per second, which obviously would represent a lot of overhead if we create one GC sample for each of these. * To avoid this overhead, we coalesce/accumulate the cpu/wall-time from multiple calls to `on_gc_start`/`on_gc_finish` and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first). Compared to the old implementation, here's what changed: * Garbage collection events get attributed to a placeholder "Garbage Collection" thread. * We no longer record "Garbage Collection" samples on the thread that triggered GC. (And thus no longer include the stack of that thread when recording GC information). This enabled a nice cleanup of the code in the stack collector * GCs across several threads can be included in the same sample * Timeline support is included * Extra metadata about the GC can be seen in the timeline * Overhead is lowered, even during heavy periods of GC action **Motivation:** I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :) **Additional Notes:** N/A **How to test the change?** Test coverage is included. During development, this tiny benchmark was useful: ```ruby require 'benchmark/ips' require 'ddtrace' puts RUBY_DESCRIPTION Datadog.configure do |c| c.profiling.enabled = true c.profiling.advanced.allocation_counting_enabled = false c.telemetry.enabled = false end class Potato def initialize(val) @potato = val end end Benchmark.ips do |x| x.config(time: 10, warmup: 2) x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)") x.compare! end ```
Configuration menu - View commit details
-
Copy full SHA for 0ac923a - Browse repository at this point
Copy the full SHA 0ac923aView commit details -
Remove
SAMPLE_IN_GC
enum entryInstead, we expose the `record_placeholder_stack` directly.
Configuration menu - View commit details
-
Copy full SHA for a2c8913 - Browse repository at this point
Copy the full SHA a2c8913View commit details -
Configuration menu - View commit details
-
Copy full SHA for ab39dce - Browse repository at this point
Copy the full SHA ab39dceView commit details -
Configuration menu - View commit details
-
Copy full SHA for f3a5c97 - Browse repository at this point
Copy the full SHA f3a5c97View commit details -
Tweak code to avoid triggering warnings on the buggy gcc that our tra…
…cer-2.7 docker image ships This allows us to keep using `-Werror` in CI without issue, which is pretty valuable.
Configuration menu - View commit details
-
Copy full SHA for b78d772 - Browse repository at this point
Copy the full SHA b78d772View commit details -
Because GC samples can be coalesced, we can't assert on 5 GCs resulting in exactly 5 GC events.
Configuration menu - View commit details
-
Copy full SHA for d97f4dc - Browse repository at this point
Copy the full SHA d97f4dcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0127e3a - Browse repository at this point
Copy the full SHA 0127e3aView commit details -
Configuration menu - View commit details
-
Copy full SHA for e8d7b4e - Browse repository at this point
Copy the full SHA e8d7b4eView commit details -
Add explanation for why we add both cpu/wall-time duration as well as…
… timeline duration
Configuration menu - View commit details
-
Copy full SHA for c6b791d - Browse repository at this point
Copy the full SHA c6b791dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2265fc3 - Browse repository at this point
Copy the full SHA 2265fc3View commit details -
Configuration menu - View commit details
-
Copy full SHA for bfdd1b7 - Browse repository at this point
Copy the full SHA bfdd1b7View commit details -
It seems that even with `GC.start()` we can observe different reasons, so I've relaxed the spec.
Configuration menu - View commit details
-
Copy full SHA for 5f42107 - Browse repository at this point
Copy the full SHA 5f42107View commit details -
Because GC samples are pushed with `thread id: GC`, this was breaking `object_id_from` in tests that happened to have some GC samples. There is no `object_id` for the "GC thread" because there is no such thread -- it's just a way of representing GC information.
Configuration menu - View commit details
-
Copy full SHA for a3bdd74 - Browse repository at this point
Copy the full SHA a3bdd74View commit details