Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-3321] Fix CPU-time accounting when fibers are used #1636

Merged
merged 2 commits into from
Aug 9, 2021

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Aug 6, 2021

The Ruby Thread API has a really sharp edge around Thread#[] and Thread#[]=: these methods are "not thread-local but fiber-local".
The documentation even goes as far as calling this "a confusion", see https://rubyapi.org/3.0/o/thread#method-i-5B-5D.

Our CPU-time tracking was suffering from this confusion. Because it was using Thread#[]/Thread#[]=, if a thread switched between fibers, our CPU-time tracking started miscounting time.

E.g., consider the following test application:

def fake_cpu_time(n)
  Thread.pass
  sleep 0.01
  Thread.pass
  puts "Fiber #{n} yielding..."
  Fiber.yield :frozen
  puts "Fiber #{n} resumed..."
  Thread.pass
  sleep 0.005
  Thread.pass
  sleep 0.005
  Thread.pass
end

def start_fibers
  fibers = (0..1000).map do |n|
    Fiber.new do
      fake_cpu_time(n)
    end
  end
  run_fibers(fibers)
  fibers
end

def actually_do_cpu_work
  puts "Starting cpu work"
  start = Time.now
  nil while Time.now < (start + 28)
  puts "Finished cpu work..."
end

def run_fibers(the_fibers)
  the_fibers.each(&:resume)
end

the_fibers = start_fibers
actually_do_cpu_work
run_fibers(the_fibers)

The following happens in this example:

  1. It starts 1000 fibers, using Thread.pass+sleep to try to give the profiler a chance to run and sample them.
  2. Because Thread#[]= is fiber-local, every fiber that is sampled gets its own copy of THREAD_LAST_CPU_TIME_KEY
  3. Then the example switches back to the original fiber, and does some CPU-heavy work for around 28 seconds
  4. The example app then runs through every fiber again, using Thread.pass+sleep to give the profiler a chance to sample every fiber.
  5. While sampling, the profiler will observe that the current cpu time for on every fiber has advanced 28 seconds from the last value stored in the THREAD_LAST_CPU_TIME_KEY
  6. Profiler will report that the same thread will have spent hours of CPU time in a 1-minute period. (In my experiment, it reported close to 2 hours, 40 minutes).

The fix here is simple: switch to using the Thread#thread_variable_get and Thread#thread_variable_set APIs, which give us the intended behavior: even across multiple fibers, we just keep a single value for the same thread, and thus we get the
expected ~28s of cpu-time in the test application.

Since this method is being called at stack sampler startup, threads
that previously were not touched were going to get written to anyway
on the next sample.

The one exception/behavior change is if cpu time profiling is
disabled or unavalable for some reason -- we would previously never set
the `THREAD_LAST_CPU_TIME_KEY` for threads in that case, but I don't
think it's worth having this distinction, so I'll take the simpler
implementation and tests instead.
The Ruby `Thread` API has a really sharp edge around `Thread#[]` and
`Thread#[]=`: these methods are "not thread-local but fiber-local".
The documentation even goes as far as calling this "a confusion", see
<https://rubyapi.org/3.0/o/thread#method-i-5B-5D>.

Our CPU-time tracking was suffering from this confusion. Because it
was using `Thread#[]`/`Thread#[]=`, if a thread switched between
fibers, our CPU-time tracking started miscounting time.

E.g., consider the following test application:

```ruby
def fake_cpu_time(n)
  Thread.pass
  sleep 0.01
  Thread.pass
  puts "Fiber #{n} yielding..."
  Fiber.yield :frozen
  puts "Fiber #{n} resumed..."
  Thread.pass
  sleep 0.005
  Thread.pass
  sleep 0.005
  Thread.pass
end

def start_fibers
  fibers = (0..1000).map do |n|
    Fiber.new do
      fake_cpu_time(n)
    end
  end
  run_fibers(fibers)
  fibers
end

def actually_do_cpu_work
  puts "Starting cpu work"
  start = Time.now
  nil while Time.now < (start + 28)
  puts "Finished cpu work..."
end

def run_fibers(the_fibers)
  the_fibers.each(&:resume)
end

the_fibers = start_fibers
actually_do_cpu_work
run_fibers(the_fibers)
```

The following happens in this example:

1. It starts 1000 fibers, using `Thread.pass`+`sleep` to try to give
   the profiler a chance to run and sample them.
2. Because `Thread#[]=` is fiber-local, every fiber that is sampled
   gets its own copy of `THREAD_LAST_CPU_TIME_KEY`
3. Then the example switches back to the original fiber, and does
   some CPU-heavy work for around 28 seconds
4. The example app then runs through every fiber again, using
   `Thread.pass`+`sleep` to give the profiler a chance to sample
   every fiber.
5. While sampling, the profiler will observe that the current cpu
   time for on every fiber has advanced 28 seconds from the last value
   stored in the `THREAD_LAST_CPU_TIME_KEY`
6. Profiler will report that the same thread will have spent hours
   of CPU time in a 1-minute period. (In my experiment, it reported
   close to 2 hours, 40 minutes).

The fix here is simple: switch to using the
`Thread#thread_variable_get` and `Thread#thread_variable_set` APIs,
which give us the intended behavior: even across multiple fibers,
we just keep a single value for the same thread, and thus we get the
expected ~28s of cpu-time in the test application.
@ivoanjo ivoanjo requested a review from a team August 6, 2021 10:39
@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 6, 2021

Alternative title for this PR: This One Ruby Trick Will Improve Your Cpu Usage By 200x:

Screen Shot 2021-08-06 at 11 42 40

@codecov-commenter
Copy link

Codecov Report

Merging #1636 (c5cbf12) into master (7859491) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1636      +/-   ##
==========================================
- Coverage   98.30%   98.30%   -0.01%     
==========================================
  Files         901      901              
  Lines       43541    43532       -9     
==========================================
- Hits        42805    42796       -9     
  Misses        736      736              
Impacted Files Coverage Δ
lib/ddtrace/profiling/collectors/stack.rb 98.92% <100.00%> (ø)
spec/ddtrace/profiling/collectors/stack_spec.rb 99.05% <100.00%> (-0.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7859491...c5cbf12. Read the comment docs.

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, we'll likely have to do similar analysis for the tracer as well.

This analysis is on our radar, specially because newer releases of graphql use fiber, and we know there's an impact with the tracer's current implementation. (Doesn't mean we can just flip it to thread_variable_* without due diligence 😅)

@marcotc marcotc added the profiling Involves Datadog profiling label Aug 6, 2021
@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 9, 2021

Yeah, for the tracer it's actually non-trivial as we may actually want a mix of fiber-local and thread-local. For instance, "current span" probably makes sense to be fiber-local -- if I start a span in a fiber, and then switch fibers, I should probably not be inside the same span.

@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 9, 2021

It occurred to me that this was a really simple change, but the resulting performance impact may be non-trivial, so I wrote up a quick benchmark, which I tested on ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-darwin19]:

require 'benchmark/ips'

OTHER_THREAD = Thread.new { sleep }

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 2)
  x.report("fiber-local-set") { OTHER_THREAD[:foo] = 1}
  x.report("thread-local-set") { OTHER_THREAD.thread_variable_set(:foo, 1) }
  x.compare!
end

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 2)
  x.report("fiber-local-get-non-existing") { OTHER_THREAD[:non_existing] }
  x.report("thread-local-get-non-existing") { OTHER_THREAD.thread_variable_get(:non_existing) }
  x.compare!
end

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 2)
  x.report("fiber-local-get-existing") { OTHER_THREAD[:foo] + 1 }
  x.report("thread-local-get-existing") { OTHER_THREAD.thread_variable_get(:foo) + 1 }
  x.compare!
end

and indeed there is a performance difference! Thread#thread_variable_get and Thread#thread_variable_set are indeed slower than Thread#[]/Thread#[]=:

Warming up --------------------------------------
     fiber-local-set     1.460M i/100ms
    thread-local-set     1.112M i/100ms
Calculating -------------------------------------
     fiber-local-set     14.469M (± 2.2%) i/s -     72.988M in   5.046837s
    thread-local-set     11.189M (± 2.0%) i/s -     56.689M in   5.068468s

Comparison:
     fiber-local-set: 14468800.7 i/s
    thread-local-set: 11189073.2 i/s - 1.29x  (± 0.00) slower

Warming up --------------------------------------
fiber-local-get-non-existing
                         1.661M i/100ms
thread-local-get-non-existing
                         1.265M i/100ms
Calculating -------------------------------------
fiber-local-get-non-existing
                         16.602M (± 2.1%) i/s -     83.032M in   5.003403s
thread-local-get-non-existing
                         12.452M (± 1.5%) i/s -     63.236M in   5.079357s

Comparison:
fiber-local-get-non-existing: 16602058.9 i/s
thread-local-get-non-existing: 12452358.3 i/s - 1.33x  (± 0.00) slower

Warming up --------------------------------------
fiber-local-get-existing
                         1.556M i/100ms
thread-local-get-existing
                         1.258M i/100ms
Calculating -------------------------------------
fiber-local-get-existing
                         15.816M (± 1.2%) i/s -     79.372M in   5.018978s
thread-local-get-existing
                         12.673M (± 2.3%) i/s -     64.154M in   5.064612s

Comparison:
fiber-local-get-existing: 15816477.8 i/s
thread-local-get-existing: 12673473.5 i/s - 1.25x  (± 0.00) slower

on the other hand, we're still talking many millions per second, so I'm confident this will not have a noticeable impact (we do way more expensive things in the sampling loop).

@ivoanjo ivoanjo merged commit e9d6ffc into master Aug 9, 2021
@ivoanjo ivoanjo deleted the ivoanjo/fix-cpu-time-accounting-with-fibers branch August 9, 2021 09:53
@github-actions github-actions bot added this to the 0.52.0 milestone Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants