Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

Merged
merged 21 commits into from
Dec 14, 2023

Commits on Dec 13, 2023

  1. Configuration menu
    Copy the full SHA
    f3da457 View commit details
    Browse the repository at this point in the history
  2. Garbage Collection samples no longer collect stacks

    When this feature was added, we weren't yet working an allocation
    profiling, so having stacks would hint at where an allocation was that
    triggered the GC.
    
    This... is basically a very crappy allocation profiler, and as
    an anti-bonus it can also get quite expensive.
    
    As part of making this feature shape up for us to have it on by
    default, let's no longer collect stacks. As an alternative, you
    can get a similar context either from allocation profiling OR by
    looking at thread timeline for what the thread was doing around the
    time the GC happened.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    3bca251 View commit details
    Browse the repository at this point in the history
  3. Remove record_buffer/extra_frames_in_record_buffer abstraction

    These abstractions added some complexity to the stack collector, and
    were only needed to allow us to put extra placeholder frames on top
    of a true collected stack.
    
    Since this was only being used by the GC placeholder feature, and
    we've changed that to not collect a stack anymore, we can clean up
    this mechanism and simplify the code.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    1f5926b View commit details
    Browse the repository at this point in the history
  4. Extract record_placeholder_stack_in_native_code again

    This function is rather trivial BUT we have a really nice big comment to
    explain it so it's useful to extract it so we can have the comment
    separate, instead of moving the big comment inside the already
    very-complex `sample_thread_internal`.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    6f922aa View commit details
    Browse the repository at this point in the history
  5. Remove old comment

    While still accurate, I don't think this comment is still relevant --
    we no longer have the skid issue, and the rest is just historical
    context that I don't think is that interesting to keep around.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    7ce8d90 View commit details
    Browse the repository at this point in the history
  6. Add timeline event type to stack recorder

    This event type will be used to report GC information for use in the
    profiling timeline.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    c376568 View commit details
    Browse the repository at this point in the history
  7. Introduce gc profiling helper

    This helper takes care of mapping `rb_gc_latest_gc_info` into metadata
    that gets added to GC samples.
    
    This helper will get tested via the thread context specs (changes
    upcoming).
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    14d9789 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    c64d78a View commit details
    Browse the repository at this point in the history
  9. [NO-TICKET] Redesign GC profiling to add timeline support and reduce …

    …overhead
    
    **What does this PR do?**
    
    This PR redesigns the garbage collection profiling feature to add support
    for timeline and reduce its overhead.
    
    A TL;DR description of how it works now is below:
    
    1. Wiring it up
    
      * Ruby exposes native-level tracepoint events for when it starts/stops
      doing GC (`RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`).
      * Inside the `CpuAndWallTimeWorker` we create a tracepoint that listens
      for these events and calls the `on_gc_start` and `on_gc_finish` events
      on the `ThreadContext` collector.
      * The `RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`
      tracepoint events are emitted while Ruby is doing GC, which means we
      can't allocate anything. The code we use is thus built to not
      allocate anything.
      * We can't do all the work we want in `on_gc_finish` because we can't
      allocate anything, so that function returns a flag that tells the
      `CpuAndWallTimeWorker` to call `sample_after_gc` as soon as the GC
      finishes.
    
    2. Tracking
    
      * When a thread starts doing GC (`on_gc_start`), we record its current
      cpu-time and wall-time.
      * When a thread stops doing GC (`on_gc_finish`), we record how much
      cpu-time and wall-time elapsed.
      * When `sample_after_gc` gets called, we turn the data recorded above
      + some metadata from Ruby into a sample event to represent GC as
      a fake Garbage Collection thread.
    
    3. Overhead
    
      * To avoid impacting applications a lot, the Ruby GC works incrementally.
      That is, rather than doing a full GC (marking, sweeping) in a single
      step, Ruby breaks this into smaller steps.
      * We get told about every smaller step: we get a pair of `on_gc_start`
      /`on_gc_finish` calls for each step.
      * In the past, we always called `sample_after_gc` after every
      `on_gc_finish` to record the GC. This was a big source of overhead:
      in a sample application that allocates as much as possible in a loop
      I was able to observe more than 50k calls to `on_gc_start`/`on_gc_finish`
      per second, which obviously would represent a lot of overhead if we
      create one GC sample for each of these.
      * To avoid this overhead, we coalesce/accumulate the cpu/wall-time from
      multiple calls to `on_gc_start`/`on_gc_finish` and only create a sample
      to represent this data from time to time (default is every 10ms or
      every full GC cycle, whatever happens first).
    
    Compared to the old implementation, here's what changed:
    
    * Garbage collection events get attributed to a placeholder
    "Garbage Collection" thread.
    * We no longer record "Garbage Collection" samples on the thread that
    triggered GC. (And thus no longer include the stack of that thread
    when recording GC information). This enabled a nice cleanup of the
    code in the stack collector
    * GCs across several threads can be included in the same sample
    * Timeline support is included
    * Extra metadata about the GC can be seen in the timeline
    * Overhead is lowered, even during heavy periods of GC action
    
    **Motivation:**
    
    I worked on this as part of the Profiler R&D week. I've been wanting
    to go back and fix the problems with the gc profiling implementation for
    a long time :)
    
    **Additional Notes:**
    
    N/A
    
    **How to test the change?**
    
    Test coverage is included.
    
    During development, this tiny benchmark was useful:
    
    ```ruby
    require 'benchmark/ips'
    require 'ddtrace'
    
    puts RUBY_DESCRIPTION
    
    Datadog.configure do |c|
      c.profiling.enabled = true
      c.profiling.advanced.allocation_counting_enabled = false
      c.telemetry.enabled = false
    end
    
    class Potato
      def initialize(val)
        @potato = val
      end
    end
    
    Benchmark.ips do |x|
      x.config(time: 10, warmup: 2)
    
      x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)")
    
      x.compare!
    end
    ```
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    0ac923a View commit details
    Browse the repository at this point in the history
  10. Remove SAMPLE_IN_GC enum entry

    Instead, we expose the `record_placeholder_stack` directly.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    a2c8913 View commit details
    Browse the repository at this point in the history
  11. Remove sample_type enum

    No longer needed!
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    ab39dce View commit details
    Browse the repository at this point in the history
  12. Fix rubocop nitpick

    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    f3a5c97 View commit details
    Browse the repository at this point in the history
  13. Tweak code to avoid triggering warnings on the buggy gcc that our tra…

    …cer-2.7 docker image ships
    
    This allows us to keep using `-Werror` in CI without issue, which is
    pretty valuable.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    b78d772 View commit details
    Browse the repository at this point in the history
  14. Tweak spec to avoid flakiness

    Because GC samples can be coalesced, we can't assert on 5 GCs resulting
    in exactly 5 GC events.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    d97f4dc View commit details
    Browse the repository at this point in the history
  15. Update outdated comment

    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    0127e3a View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    e8d7b4e View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    c6b791d View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    2265fc3 View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    bfdd1b7 View commit details
    Browse the repository at this point in the history
  20. Fix spec flakyness

    It seems that even with `GC.start()` we can observe different reasons,
    so I've relaxed the spec.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    5f42107 View commit details
    Browse the repository at this point in the history
  21. Fix another flaky spec

    Because GC samples are pushed with `thread id: GC`, this was breaking
    `object_id_from` in tests that happened to have some GC samples.
    
    There is no `object_id` for the "GC thread" because there is no such
    thread -- it's just a way of representing GC information.
    ivoanjo committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    a3bdd74 View commit details
    Browse the repository at this point in the history