Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler pprof encoding benchmark and first improvements #1511

Merged
merged 2 commits into from
May 21, 2021

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented May 20, 2021

While benchmarking discourse with profiler, I noticed a pattern of increased latency on the app whenever a request happened concurrently with pprof exporting.

A quick look with a profiler revealed that there seems to be some low-hanging fruit in our exporting process. I've addressed the lowest of the low-hanging fruit and already got a nice improvement:

       exporter baseline      2.699  (± 0.0%) i/s -     27.000  in  10.026990s
    exporter cached-build-location-v2
                              4.356  (± 0.0%) i/s -     44.000  in  10.114602s

    Comparison:
    exporter cached-build-location-v2:        4.4 i/s
       exporter baseline:        2.7 i/s - 1.61x  (± 0.00) slower

I have a few more leads/ideas for improvements, but I'll submit them separately.

While benchmarking discourse with profiler, I noticed a pattern of
increased latency on the app whenever a request happened concurrently
with pprof exporting.

A quick look with a profiler revealed that there seems to be some
low-hanging fruit in our exporting process, so I've added this
benchmark to allow me to iterate on improvements to this part of the
code.
This trivial change yields a rather nice improvement to performance
in the <benchmarks/profiler_submission.rb> benchmark
(Ruby 2.7.3/macOS 10.15.7/i7-1068NG7):

```
   exporter baseline      2.699  (± 0.0%) i/s -     27.000  in  10.026990s
exporter cached-build-location-v2
                          4.356  (± 0.0%) i/s -     44.000  in  10.114602s

Comparison:
exporter cached-build-location-v2:        4.4 i/s
   exporter baseline:        2.7 i/s - 1.61x  (± 0.00) slower
```
@ivoanjo ivoanjo requested a review from a team May 20, 2021 11:57
@codecov-commenter
Copy link

Codecov Report

Merging #1511 (bb292d1) into master (88d321b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1511   +/-   ##
=======================================
  Coverage   98.22%   98.22%           
=======================================
  Files         862      862           
  Lines       41686    41688    +2     
=======================================
+ Hits        40945    40947    +2     
  Misses        741      741           
Impacted Files Coverage Δ
lib/ddtrace/profiling/pprof/builder.rb 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88d321b...bb292d1. Read the comment docs.

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏎️

@ivoanjo ivoanjo merged commit 09ed439 into master May 21, 2021
@ivoanjo ivoanjo deleted the ivoanjo/pprof-encoding-benchmark-and-improvements branch May 21, 2021 07:43
@github-actions github-actions bot added this to the 0.50.0 milestone May 21, 2021
ivoanjo added a commit that referenced this pull request Sep 22, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.
ivoanjo added a commit that referenced this pull request Sep 23, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.
ivoanjo added a commit that referenced this pull request Sep 23, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
ivoanjo added a commit that referenced this pull request Sep 23, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
ivoanjo added a commit that referenced this pull request Sep 24, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants