Profiler pprof encoding benchmark and first improvements #1511

ivoanjo · 2021-05-20T11:57:51Z

While benchmarking discourse with profiler, I noticed a pattern of increased latency on the app whenever a request happened concurrently with pprof exporting.

A quick look with a profiler revealed that there seems to be some low-hanging fruit in our exporting process. I've addressed the lowest of the low-hanging fruit and already got a nice improvement:

       exporter baseline      2.699  (± 0.0%) i/s -     27.000  in  10.026990s
    exporter cached-build-location-v2
                              4.356  (± 0.0%) i/s -     44.000  in  10.114602s

    Comparison:
    exporter cached-build-location-v2:        4.4 i/s
       exporter baseline:        2.7 i/s - 1.61x  (± 0.00) slower

I have a few more leads/ideas for improvements, but I'll submit them separately.

While benchmarking discourse with profiler, I noticed a pattern of increased latency on the app whenever a request happened concurrently with pprof exporting. A quick look with a profiler revealed that there seems to be some low-hanging fruit in our exporting process, so I've added this benchmark to allow me to iterate on improvements to this part of the code.

This trivial change yields a rather nice improvement to performance in the <benchmarks/profiler_submission.rb> benchmark (Ruby 2.7.3/macOS 10.15.7/i7-1068NG7): ``` exporter baseline 2.699 (± 0.0%) i/s - 27.000 in 10.026990s exporter cached-build-location-v2 4.356 (± 0.0%) i/s - 44.000 in 10.114602s Comparison: exporter cached-build-location-v2: 4.4 i/s exporter baseline: 2.7 i/s - 1.61x (± 0.00) slower ```

codecov-commenter · 2021-05-20T13:06:06Z

Codecov Report

Merging #1511 (bb292d1) into master (88d321b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1511   +/-   ##
=======================================
  Coverage   98.22%   98.22%           
=======================================
  Files         862      862           
  Lines       41686    41688    +2     
=======================================
+ Hits        40945    40947    +2     
  Misses        741      741

Impacted Files	Coverage Δ
lib/ddtrace/profiling/pprof/builder.rb	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88d321b...bb292d1. Read the comment docs.

marcotc

🏎️

Prior to this change, the profiler tried to sample all threads each time it decided to sample. This could have a negative impact on the latency of Ruby services with a lot of threads, because each sampling pass would be quite expensive. Just as an illustration, consider the `benchmarks/profiler_sample_loop` benchmark which tries to sample as fast as possible 4 threads with very deep (500 frames+) stacks. As the number of threads to sample increases, so does the performance of sampling degrade: | Sampled threads | Iterations per second | |-----------------|-----------------------------------------------------| | 4 threads | 390.921 (± 6.1%) i/s - 3.920k in 10.075624s | | 16 threads | 97.037 (± 6.2%) i/s - 972.000 in 10.072298s | | 32 threads | 49.576 (± 4.0%) i/s - 496.000 in 10.038491s | | 64 threads | 24.060 (± 8.3%) i/s - 240.000 in 10.033481s | (All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19]) When combined with our dynamic sampling mechanism, if the profiler starts to take longer to sample, it just samples less often. BUT this means that it can impact service latency because suddenly we may be spending 40ms+ of CPU time (64 thread example + 500 frames, very extreme) which from experience (#1511, #1522) is not good. Instead, this PR changes the stack sampler to sample up to a set maximum of threads (16), randomly selected among all threads. This means that adding more threads to the system means very little degradation: | Sampled threads | Iterations per second | |-----------------|----------------------------------------------------| | 16 threads | 104.859 (± 6.7%) i/s - 1.050k in 10.068781s | | 32 threads | 100.957 (± 5.9%) i/s - 1.010k in 10.047250s | | 64 threads | 98.098 (± 5.1%) i/s - 981.000 in 10.038244s | | 256 threads | 84.740 (± 8.3%) i/s - 848.000 in 10.074037s | There's still a bit of degradation that I suspect is from pure VM overhead -- 256+ threads on a single Ruby VM adds up. Because we pick the threads to sample randomly, we'll eventually sample all threads -- just not at once. Finally, regarding the dynamic sampling mechanism, because the profiler will not take longer to sample, it will sample more often, and thus over a longer period we should take sample roughly the same samples.

Prior to this change, the profiler tried to sample all threads each time it decided to sample. This could have a negative impact on the latency of Ruby services with a lot of threads, because each sampling pass would be quite expensive. Just as an illustration, consider the `benchmarks/profiler_sample_loop` benchmark which tries to sample as fast as possible 4 threads with very deep (500 frames+) stacks. As the number of threads to sample increases, so does the performance of sampling degrade: | Sampled threads | Iterations per second | |-----------------|-----------------------------------------------------| | 4 threads | 390.921 (± 6.1%) i/s - 3.920k in 10.075624s | | 16 threads | 97.037 (± 6.2%) i/s - 972.000 in 10.072298s | | 32 threads | 49.576 (± 4.0%) i/s - 496.000 in 10.038491s | | 64 threads | 24.060 (± 8.3%) i/s - 240.000 in 10.033481s | (All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19]) When combined with our dynamic sampling mechanism, if the profiler starts to take longer to sample, it just samples less often. BUT this means that it can impact service latency because suddenly we may be spending 40ms+ of CPU time (64 thread example + 500 frames, very extreme) which from experience (#1511, #1522) is not good. Instead, this PR changes the stack sampler to sample up to a set maximum of threads (16), randomly selected among all threads. This means that adding more threads to the system means very little degradation: | Sampled threads | Iterations per second | |-----------------|----------------------------------------------------| | 16 threads | 104.859 (± 6.7%) i/s - 1.050k in 10.068781s | | 32 threads | 100.957 (± 5.9%) i/s - 1.010k in 10.047250s | | 64 threads | 98.098 (± 5.1%) i/s - 981.000 in 10.038244s | | 256 threads | 84.740 (± 8.3%) i/s - 848.000 in 10.074037s | There's still a bit of degradation that I suspect is from pure VM overhead -- 256+ threads on a single Ruby VM adds up. Because we pick the threads to sample randomly, we'll eventually sample all threads -- just not at once. Finally, regarding the dynamic sampling mechanism, because the profiler will not take longer to sample, it will sample more often, and thus over a longer period we should take sample roughly the same samples. One downside of this approach is that if there really are many threads, the resulting wall clock times in a one minute profile may "drift" around the 60 second mark, e.g. maybe we only sampled a thread once per second and only 59 times, so we'll report 59s, but on the next report we'll include the missing one, so then the result will be 61s. I've observed 60 +- 1.68 secs for an app with ~65 threads, given the default maximum of 16 threads. This seems a reasonable enough margin of error given the improvement to latency (especially on such a large application! -> even bigger latency impact if we tried to sample all threads). If it helps, I've used this trivial Ruby app for testing with many threads: ```ruby 32.times do |n| eval(" def sleeper#{n} sleep end ") Thread.new { send(:"sleeper#{n}") } end sleep ```

ivoanjo added 2 commits May 20, 2021 12:53

ivoanjo requested a review from a team May 20, 2021 11:57

marcotc approved these changes May 20, 2021

View reviewed changes

ivoanjo merged commit 09ed439 into master May 21, 2021

ivoanjo deleted the ivoanjo/pprof-encoding-benchmark-and-improvements branch May 21, 2021 07:43

github-actions bot added this to the 0.50.0 milestone May 21, 2021

ivoanjo mentioned this pull request May 24, 2021

More profiler pprof encoding performance improvements #1512

Merged

ivoanjo mentioned this pull request Sep 24, 2021

[PROF-3814] Limit number of threads per sample #1699

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiler pprof encoding benchmark and first improvements #1511

Profiler pprof encoding benchmark and first improvements #1511

ivoanjo commented May 20, 2021

codecov-commenter commented May 20, 2021

marcotc left a comment

Profiler pprof encoding benchmark and first improvements #1511

Profiler pprof encoding benchmark and first improvements #1511

Conversation

ivoanjo commented May 20, 2021

codecov-commenter commented May 20, 2021

Codecov Report

marcotc left a comment

Choose a reason for hiding this comment