[PROF-3321] Fix CPU-time accounting when fibers are used #1636

ivoanjo · 2021-08-06T10:39:26Z

The Ruby Thread API has a really sharp edge around Thread#[] and Thread#[]=: these methods are "not thread-local but fiber-local".
The documentation even goes as far as calling this "a confusion", see https://rubyapi.org/3.0/o/thread#method-i-5B-5D.

Our CPU-time tracking was suffering from this confusion. Because it was using Thread#[]/Thread#[]=, if a thread switched between fibers, our CPU-time tracking started miscounting time.

E.g., consider the following test application:

def fake_cpu_time(n)
  Thread.pass
  sleep 0.01
  Thread.pass
  puts "Fiber #{n} yielding..."
  Fiber.yield :frozen
  puts "Fiber #{n} resumed..."
  Thread.pass
  sleep 0.005
  Thread.pass
  sleep 0.005
  Thread.pass
end

def start_fibers
  fibers = (0..1000).map do |n|
    Fiber.new do
      fake_cpu_time(n)
    end
  end
  run_fibers(fibers)
  fibers
end

def actually_do_cpu_work
  puts "Starting cpu work"
  start = Time.now
  nil while Time.now < (start + 28)
  puts "Finished cpu work..."
end

def run_fibers(the_fibers)
  the_fibers.each(&:resume)
end

the_fibers = start_fibers
actually_do_cpu_work
run_fibers(the_fibers)

The following happens in this example:

It starts 1000 fibers, using Thread.pass+sleep to try to give the profiler a chance to run and sample them.
Because Thread#[]= is fiber-local, every fiber that is sampled gets its own copy of THREAD_LAST_CPU_TIME_KEY
Then the example switches back to the original fiber, and does some CPU-heavy work for around 28 seconds
The example app then runs through every fiber again, using Thread.pass+sleep to give the profiler a chance to sample every fiber.
While sampling, the profiler will observe that the current cpu time for on every fiber has advanced 28 seconds from the last value stored in the THREAD_LAST_CPU_TIME_KEY
Profiler will report that the same thread will have spent hours of CPU time in a 1-minute period. (In my experiment, it reported close to 2 hours, 40 minutes).

The fix here is simple: switch to using the Thread#thread_variable_get and Thread#thread_variable_set APIs, which give us the intended behavior: even across multiple fibers, we just keep a single value for the same thread, and thus we get the
expected ~28s of cpu-time in the test application.

Since this method is being called at stack sampler startup, threads that previously were not touched were going to get written to anyway on the next sample. The one exception/behavior change is if cpu time profiling is disabled or unavalable for some reason -- we would previously never set the `THREAD_LAST_CPU_TIME_KEY` for threads in that case, but I don't think it's worth having this distinction, so I'll take the simpler implementation and tests instead.

The Ruby `Thread` API has a really sharp edge around `Thread#[]` and `Thread#[]=`: these methods are "not thread-local but fiber-local". The documentation even goes as far as calling this "a confusion", see <https://rubyapi.org/3.0/o/thread#method-i-5B-5D>. Our CPU-time tracking was suffering from this confusion. Because it was using `Thread#[]`/`Thread#[]=`, if a thread switched between fibers, our CPU-time tracking started miscounting time. E.g., consider the following test application: ```ruby def fake_cpu_time(n) Thread.pass sleep 0.01 Thread.pass puts "Fiber #{n} yielding..." Fiber.yield :frozen puts "Fiber #{n} resumed..." Thread.pass sleep 0.005 Thread.pass sleep 0.005 Thread.pass end def start_fibers fibers = (0..1000).map do |n| Fiber.new do fake_cpu_time(n) end end run_fibers(fibers) fibers end def actually_do_cpu_work puts "Starting cpu work" start = Time.now nil while Time.now < (start + 28) puts "Finished cpu work..." end def run_fibers(the_fibers) the_fibers.each(&:resume) end the_fibers = start_fibers actually_do_cpu_work run_fibers(the_fibers) ``` The following happens in this example: 1. It starts 1000 fibers, using `Thread.pass`+`sleep` to try to give the profiler a chance to run and sample them. 2. Because `Thread#[]=` is fiber-local, every fiber that is sampled gets its own copy of `THREAD_LAST_CPU_TIME_KEY` 3. Then the example switches back to the original fiber, and does some CPU-heavy work for around 28 seconds 4. The example app then runs through every fiber again, using `Thread.pass`+`sleep` to give the profiler a chance to sample every fiber. 5. While sampling, the profiler will observe that the current cpu time for on every fiber has advanced 28 seconds from the last value stored in the `THREAD_LAST_CPU_TIME_KEY` 6. Profiler will report that the same thread will have spent hours of CPU time in a 1-minute period. (In my experiment, it reported close to 2 hours, 40 minutes). The fix here is simple: switch to using the `Thread#thread_variable_get` and `Thread#thread_variable_set` APIs, which give us the intended behavior: even across multiple fibers, we just keep a single value for the same thread, and thus we get the expected ~28s of cpu-time in the test application.

ivoanjo · 2021-08-06T10:43:23Z

Alternative title for this PR: This One Ruby Trick Will Improve Your Cpu Usage By 200x:

codecov-commenter · 2021-08-06T10:43:48Z

Codecov Report

Merging #1636 (c5cbf12) into master (7859491) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1636      +/-   ##
==========================================
- Coverage   98.30%   98.30%   -0.01%     
==========================================
  Files         901      901              
  Lines       43541    43532       -9     
==========================================
- Hits        42805    42796       -9     
  Misses        736      736

Impacted Files	Coverage Δ
lib/ddtrace/profiling/collectors/stack.rb	`98.92% <100.00%> (ø)`
spec/ddtrace/profiling/collectors/stack_spec.rb	`99.05% <100.00%> (-0.03%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7859491...c5cbf12. Read the comment docs.

marcotc

Nice catch, we'll likely have to do similar analysis for the tracer as well.

This analysis is on our radar, specially because newer releases of graphql use fiber, and we know there's an impact with the tracer's current implementation. (Doesn't mean we can just flip it to thread_variable_* without due diligence 😅)

ivoanjo · 2021-08-09T09:42:22Z

Yeah, for the tracer it's actually non-trivial as we may actually want a mix of fiber-local and thread-local. For instance, "current span" probably makes sense to be fiber-local -- if I start a span in a fiber, and then switch fibers, I should probably not be inside the same span.

ivoanjo · 2021-08-09T09:53:11Z

It occurred to me that this was a really simple change, but the resulting performance impact may be non-trivial, so I wrote up a quick benchmark, which I tested on ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-darwin19]:

require 'benchmark/ips'

OTHER_THREAD = Thread.new { sleep }

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 2)
  x.report("fiber-local-set") { OTHER_THREAD[:foo] = 1}
  x.report("thread-local-set") { OTHER_THREAD.thread_variable_set(:foo, 1) }
  x.compare!
end

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 2)
  x.report("fiber-local-get-non-existing") { OTHER_THREAD[:non_existing] }
  x.report("thread-local-get-non-existing") { OTHER_THREAD.thread_variable_get(:non_existing) }
  x.compare!
end

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 2)
  x.report("fiber-local-get-existing") { OTHER_THREAD[:foo] + 1 }
  x.report("thread-local-get-existing") { OTHER_THREAD.thread_variable_get(:foo) + 1 }
  x.compare!
end

and indeed there is a performance difference! Thread#thread_variable_get and Thread#thread_variable_set are indeed slower than Thread#[]/Thread#[]=:

Warming up --------------------------------------
     fiber-local-set     1.460M i/100ms
    thread-local-set     1.112M i/100ms
Calculating -------------------------------------
     fiber-local-set     14.469M (± 2.2%) i/s -     72.988M in   5.046837s
    thread-local-set     11.189M (± 2.0%) i/s -     56.689M in   5.068468s

Comparison:
     fiber-local-set: 14468800.7 i/s
    thread-local-set: 11189073.2 i/s - 1.29x  (± 0.00) slower

Warming up --------------------------------------
fiber-local-get-non-existing
                         1.661M i/100ms
thread-local-get-non-existing
                         1.265M i/100ms
Calculating -------------------------------------
fiber-local-get-non-existing
                         16.602M (± 2.1%) i/s -     83.032M in   5.003403s
thread-local-get-non-existing
                         12.452M (± 1.5%) i/s -     63.236M in   5.079357s

Comparison:
fiber-local-get-non-existing: 16602058.9 i/s
thread-local-get-non-existing: 12452358.3 i/s - 1.33x  (± 0.00) slower

Warming up --------------------------------------
fiber-local-get-existing
                         1.556M i/100ms
thread-local-get-existing
                         1.258M i/100ms
Calculating -------------------------------------
fiber-local-get-existing
                         15.816M (± 1.2%) i/s -     79.372M in   5.018978s
thread-local-get-existing
                         12.673M (± 2.3%) i/s -     64.154M in   5.064612s

Comparison:
fiber-local-get-existing: 15816477.8 i/s
thread-local-get-existing: 12673473.5 i/s - 1.25x  (± 0.00) slower

on the other hand, we're still talking many millions per second, so I'm confident this will not have a noticeable impact (we do way more expensive things in the sampling loop).

ivoanjo added 2 commits August 6, 2021 11:08

ivoanjo requested a review from a team August 6, 2021 10:39

marcotc approved these changes Aug 6, 2021

View reviewed changes

marcotc added the profiling Involves Datadog profiling label Aug 6, 2021

ivoanjo merged commit e9d6ffc into master Aug 9, 2021

ivoanjo deleted the ivoanjo/fix-cpu-time-accounting-with-fibers branch August 9, 2021 09:53

github-actions bot added this to the 0.52.0 milestone Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-3321] Fix CPU-time accounting when fibers are used #1636

[PROF-3321] Fix CPU-time accounting when fibers are used #1636

ivoanjo commented Aug 6, 2021

ivoanjo commented Aug 6, 2021

codecov-commenter commented Aug 6, 2021

marcotc left a comment

ivoanjo commented Aug 9, 2021

ivoanjo commented Aug 9, 2021

[PROF-3321] Fix CPU-time accounting when fibers are used #1636

[PROF-3321] Fix CPU-time accounting when fibers are used #1636

Conversation

ivoanjo commented Aug 6, 2021

ivoanjo commented Aug 6, 2021

codecov-commenter commented Aug 6, 2021

Codecov Report

marcotc left a comment

Choose a reason for hiding this comment

ivoanjo commented Aug 9, 2021

ivoanjo commented Aug 9, 2021