Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor fast delta profiling #1563

Conversation

felixge
Copy link
Member

@felixge felixge commented Nov 2, 2022

This PR attempts to create a better separation of concerns in the code while fixing a few bugs and improving performance along the way.

This wasn't the main goal, but it seems like the big-heap.pprof bench
also benefits from this a lot. Since this benchmark is the most
realistic, that's very nice.

benchstat before.txt after.txt
name                                  old time/op    new time/op    delta
FastDelta/testdata/heap.pprof-12         844µs ± 3%     865µs ± 1%     ~     (p=0.095 n=5+5)
FastDelta/testdata/big-heap.pprof-12    10.6ms ± 2%    10.7ms ± 1%     ~     (p=0.151 n=5+5)

name                                  old alloc/op   new alloc/op   delta
FastDelta/testdata/heap.pprof-12         175kB ± 0%     216kB ± 0%  +23.61%  (p=0.008 n=5+5)
FastDelta/testdata/big-heap.pprof-12    1.53MB ± 0%    1.54MB ± 0%   +0.76%  (p=0.008 n=5+5)

name                                  old allocs/op  new allocs/op  delta
FastDelta/testdata/heap.pprof-12           178 ± 0%       187 ± 0%   +5.06%  (p=0.016 n=4+5)
FastDelta/testdata/big-heap.pprof-12     1.17k ± 0%     0.50k ± 0%  -57.27%  (p=0.008 n=5+5)
benchstat after-3.txt after-4.txt
name                                  old time/op    new time/op    delta
FastDelta/testdata/heap.pprof-12         889µs ± 1%     851µs ± 0%   -4.28%  (p=0.008 n=5+5)
FastDelta/testdata/big-heap.pprof-12    11.1ms ± 1%    10.4ms ± 0%   -5.70%  (p=0.008 n=5+5)

name                                  old speed      new speed      delta
FastDelta/testdata/heap.pprof-12      29.7MB/s ± 1%  31.0MB/s ± 0%   +4.47%  (p=0.008 n=5+5)
FastDelta/testdata/big-heap.pprof-12  28.1MB/s ± 1%  29.8MB/s ± 0%   +6.05%  (p=0.008 n=5+5)

name                                  old alloc/op   new alloc/op   delta
FastDelta/testdata/heap.pprof-12         216kB ± 0%     209kB ± 0%   -3.46%  (p=0.008 n=5+5)
FastDelta/testdata/big-heap.pprof-12    1.54MB ± 0%    1.44MB ± 0%   -6.59%  (p=0.008 n=5+5)

name                                  old allocs/op  new allocs/op  delta
FastDelta/testdata/heap.pprof-12           187 ± 0%       160 ± 0%  -14.04%  (p=0.008 n=5+5)
FastDelta/testdata/big-heap.pprof-12       503 ± 0%       388 ± 1%  -22.90%  (p=0.008 n=5+5)
Right now a failure of this test was spitting the entire binary dump of
he profile to stdout.
@felixge felixge changed the title Felix.geisendoerfer/feature fast delta profiling 3 refactor fast delta profiling Nov 3, 2022
benchstat says its not significant, but it looks like a 7% win to me and
can be reproduced. Probably would need to run this on a metal host to
get more stable numbers.

name                              old time/op    new time/op     delta
FastDelta/testdata/heap.pprof-10     267µs ± 0%      246µs ± 0%   ~     (p=0.200 n=1+9)

name                              old speed      new speed       delta
FastDelta/testdata/heap.pprof-10  99.0MB/s ± 0%  107.3MB/s ± 0%   ~     (p=0.200 n=1+9)

name                              old alloc/op   new alloc/op    delta
FastDelta/testdata/heap.pprof-10      532B ± 0%       530B ± 1%   ~     (p=0.364 n=1+10)

name                              old allocs/op  new allocs/op   delta
FastDelta/testdata/heap.pprof-10      20.0 ± 0%       20.0 ± 0%   ~     (all equal)
still hacky and needs cleaning up ... but this is neat :)
@pmbauer
Copy link
Member

pmbauer commented Nov 6, 2022

As this is a pretty significant rewrite, another round of fidelity check testing in relenv and testing on prod instances is warranted. The main fastdelta branch has gone through all that and we have built some confidence in it. Does this rewrite risk missing 1.44.0?

@nsrip-dd nsrip-dd marked this pull request as ready for review November 7, 2022 23:33
@nsrip-dd nsrip-dd requested review from a team as code owners November 7, 2022 23:33
Copy link
Member

@pmbauer pmbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some error messages and comments need cleaned up, but overall I like the refactor.

Some general observations.

  • Most of the memory optimizations could have been made on the original implementation.
  • One thing I like about the original implementation was the readability. Having the structure of the pprof stream un-abstracted arguably makes the code simpler and easier to comprehend as a whole.
  • Abstracting the pprof parsing into pproflite might make this more re-usable. With fastdelta as the only dependent, YAGNI applies. But I think making the pprof parsing more general as this refactor does is a good trade-off.

Ideally, this near-complete-rewrite would have come after the review on #1511 was completed and we'd already have 1.44.0 released.

But given we've gone ahead and made bug fixes to this branch instead and effectively made this the working branch for the past week, my vote is to go ahead and get it merged.

I won't have time this week to run some production tests on this branch as we have already done with feature-fast-delta-profiling. So this is what I'd like to do:

  1. get this merged into feature-fast-delta ASAP
  2. get @nsrip-dd 's fix for profiler/internal/fastdelta: handle duplicate samples #1571 merged. Nick and I chatted out-of-band, and my vote is to pay the cost of another hash pass so we separate aggregation and sample writing. This rather than aggregating all samples in-memory then writing them out. He has a fix in progress based on this branch's refactor, another reason to merge this ASAP. We both somehow completely missed the failed fidelity checks, mostly because it appears logs aren't being collected for the reliability env pods. I'd not observed failure messages when searching logs, but it turns out the logs are just not there. 🤦
  3. next week, deploy feature-fast-delta-profiling onto a production shadow (again).
  4. cut 1.44 with fastdelta disabled by default.

@nsrip-dd and I discussed 4 at length out-of-band.
Even though we might have more confidence than a week ago that we've tested throughly:

  • the upside is low - the performance improvement is not going to be noticed by most services
  • the downside of something missed is high, forcing customers to downgrade or deploy an env var during the holidays.

Given the low upside and high downside, I do not endorse releasing fastdelta with it enabled by default on the first release.

I do not think deploying on our staging services is a good test for this case, at least not without significant effort from us. Most of our highest-throughput services (metrics) are not actively tended to there as the teams' workflow is focused only test-in-prod techniques (shadows, canaries, etc). Any verification and load testing in staging would be left entirely to us.

If released behind an env var flag, getting teams to opt into the new feature is an easy sell if their service is dominated by delta profile resource utilization and an easy roll-back if something goes wrong.

profiler/internal/fastdelta/fd.go Outdated Show resolved Hide resolved
profiler/internal/fastdelta/fd.go Outdated Show resolved Hide resolved
profiler/internal/fastdelta/fd.go Outdated Show resolved Hide resolved
profiler/internal/fastdelta/fd.go Outdated Show resolved Hide resolved
func(f pproflite.Field) error {
fn, ok := f.(*pproflite.Function)
if !ok {
return fmt.Errorf("functionPass: unexpected field: %T", f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error message

profiler/internal/fastdelta/location_index.go Show resolved Hide resolved
profiler/internal/pproflite/decoder.go Show resolved Hide resolved
nsrip-dd and others added 8 commits November 10, 2022 10:50
Heap profiles from the Go runtime sometimes contain multiple samples with the same call stack and labels. Normally we would expect such samples to be aggregated, and in fact our implementation assumes that is the case. However, profiles with multiple samples which should have been aggregated are actually common. The original google/pprof-based delta implementation aggregated these duplicate samples, but our implementation did not. This resulted in disagreements between the two implementations. Multiple such profiles were found by running the TestRepeatedHeapProfile stress test and have been added to the test corpus, and this has also been observed in internal testing. This is possibly unintentional behavior from the runtime, but we can account for it until it is fully diagnosed and fixed upstream.

This commit adds another pass to aggregate duplicate samples before the pass which diffs the previous and current sample values. This, unfortunately, eats into some of our performance gains. However, we still maintain a significant performance improvement over the original implementation.

name                                              old time/op        new time/op        delta
FastDelta/testdata/heap.pprof/setup-8                    934µs ± 7%        1414µs ± 5%  +51.40%  (p=0.000 n=8+9)
FastDelta/testdata/heap.pprof/steady-state-8             329µs ± 9%         605µs ±22%  +83.61%  (p=0.000 n=10+10)
FastDelta/testdata/big-heap.pprof/setup-8               10.4ms ± 3%        17.7ms ± 9%  +69.22%  (p=0.000 n=10+10)
FastDelta/testdata/big-heap.pprof/steady-state-8        4.03ms ± 2%        6.96ms ± 7%  +72.81%  (p=0.000 n=9+10)
MakeGolden/testdata/heap.pprof-8                        3.91ms ± 3%        3.90ms ± 4%     ~     (p=0.604 n=9+10)
MakeGolden/testdata/big-heap.pprof-8                    53.2ms ±19%        45.2ms ± 2%  -15.06%  (p=0.000 n=10+9)

name                                              old speed          new speed          delta
FastDelta/testdata/heap.pprof/setup-8                 28.3MB/s ± 7%      18.7MB/s ± 5%  -33.99%  (p=0.000 n=8+9)
FastDelta/testdata/heap.pprof/steady-state-8          80.3MB/s ± 9%      44.2MB/s ±19%  -44.96%  (p=0.000 n=10+10)
FastDelta/testdata/big-heap.pprof/setup-8             29.8MB/s ± 3%      17.7MB/s ± 8%  -40.79%  (p=0.000 n=10+10)
FastDelta/testdata/big-heap.pprof/steady-state-8      77.3MB/s ± 2%      44.8MB/s ± 6%  -42.07%  (p=0.000 n=9+10)

name                                              old alloc/op       new alloc/op       delta
FastDelta/testdata/heap.pprof/setup-8                    151kB ± 0%         242kB ± 0%  +60.30%  (p=0.000 n=10+10)
FastDelta/testdata/heap.pprof/steady-state-8             0.00B              0.00B          ~     (all equal)
FastDelta/testdata/big-heap.pprof/setup-8                946kB ± 0%        1713kB ± 0%  +81.06%  (p=0.000 n=10+10)
FastDelta/testdata/big-heap.pprof/steady-state-8         33.4B ± 8%         39.2B ±13%  +17.21%  (p=0.000 n=9+10)
MakeGolden/testdata/heap.pprof-8                        2.98MB ± 0%        2.98MB ± 0%   -0.00%  (p=0.000 n=8+10)
MakeGolden/testdata/big-heap.pprof-8                    35.4MB ± 0%        35.4MB ± 0%     ~     (p=0.089 n=10+10)

name                                              old allocs/op      new allocs/op      delta
FastDelta/testdata/heap.pprof/setup-8                      138 ± 1%           139 ± 1%   +0.72%  (p=0.001 n=10+10)
FastDelta/testdata/heap.pprof/steady-state-8              0.00               0.00          ~     (all equal)
FastDelta/testdata/big-heap.pprof/setup-8                  313 ± 1%           391 ± 1%  +24.86%  (p=0.000 n=10+10)
FastDelta/testdata/big-heap.pprof/steady-state-8          0.00               0.00          ~     (all equal)
MakeGolden/testdata/heap.pprof-8                         41.2k ± 0%         41.2k ± 0%     ~     (all equal)
MakeGolden/testdata/big-heap.pprof-8                      524k ± 0%          524k ± 0%   -0.00%  (p=0.037 n=10+10)

name                                              old heap-alloc/op  new heap-alloc/op  delta
FastDelta/testdata/heap.pprof/steady-state-8            162kB ±237%           0kB          ~     (p=0.173 n=10+9)
FastDelta/testdata/big-heap.pprof/steady-state-8        538kB ±130%        601kB ±100%     ~     (p=0.378 n=10+10)
name                                  old time/op    new time/op     delta
Delta/heap.pprof/setup-10               2.69ms ± 1%     0.92ms ± 0%   -65.88%  (p=0.000 n=10+8)
Delta/heap.pprof/steady-state-10        2.05ms ± 0%     0.38ms ± 0%   -81.53%  (p=0.000 n=10+10)
Delta/big-heap.pprof/setup-10           32.1ms ± 1%     10.7ms ± 0%   -66.56%  (p=0.000 n=9+8)
Delta/big-heap.pprof/steady-state-10    26.5ms ± 2%      4.7ms ± 0%   -82.36%  (p=0.000 n=10+9)

name                                  old speed      new speed       delta
Delta/heap.pprof/setup-10             9.82MB/s ± 1%  28.78MB/s ± 0%  +193.05%  (p=0.000 n=10+8)
Delta/heap.pprof/steady-state-10      12.9MB/s ± 0%   69.6MB/s ± 0%  +441.42%  (p=0.000 n=10+10)
Delta/big-heap.pprof/setup-10         9.68MB/s ± 1%  28.96MB/s ± 0%  +199.07%  (p=0.000 n=9+8)
Delta/big-heap.pprof/steady-state-10  11.7MB/s ± 2%   66.5MB/s ± 0%  +466.96%  (p=0.000 n=10+9)
name                                  old alloc/op             new alloc/op             delta
Delta/heap.pprof/setup-10                         2.99MB ± 0%              0.24MB ± 0%   -91.94%  (p=0.000 n=10+10)
Delta/heap.pprof/steady-state-10                  2.38MB ± 0%              0.00MB       -100.00%  (p=0.000 n=10+10)
Delta/big-heap.pprof/setup-10                     36.1MB ± 0%               1.7MB ± 0%   -95.26%  (p=0.000 n=10+10)
Delta/big-heap.pprof/steady-state-10              28.7MB ± 0%               0.0MB       -100.00%  (p=0.000 n=9+10)

name                                  old allocs/op            new allocs/op            delta
Delta/heap.pprof/setup-10                          41.2k ± 0%                0.1k ± 0%   -99.67%  (p=0.000 n=10+10)
Delta/heap.pprof/steady-state-10                   34.3k ± 0%                0.0k       -100.00%  (p=0.000 n=10+10)
Delta/big-heap.pprof/setup-10                       527k ± 0%                  0k ± 0%   -99.93%  (p=0.000 n=10+9)
Delta/big-heap.pprof/steady-state-10                450k ± 0%                  0k       -100.00%  (p=0.000 n=9+10)

name                                  old heap-inuse-alloc/op  new heap-inuse-alloc/op  delta
Delta/heap.pprof/steady-state-10                   406kB ± 0%               140kB ± 0%   -65.44%  (p=0.000 n=9+10)
Delta/big-heap.pprof/steady-state-10              4.58MB ± 0%              0.81MB ± 1%   -82.23%  (p=0.000 n=9+10)
@felixge felixge merged commit ebd6ea5 into feature-fast-delta-profiling Nov 11, 2022
@felixge felixge deleted the felix.geisendoerfer/feature-fast-delta-profiling-3 branch November 11, 2022 22:14
@felixge
Copy link
Member Author

felixge commented Nov 12, 2022

@pmbauer thank you so much for your thoughtful review and comments on this. I failed to proactively communicate on some of my work here, but the gist is that it was very much a result of my code review. I mostly wanted to make sure I understand everything very deeply and not make suggestions that I wasn't confident in being possible. It's difficult with performance sensitive code like this, because it usually is at odds with abstraction and separation of concerns.

Both Nick and I are aligned with you on shipping this to customers off by default and doing more battle hardening internally before changing that. But I also want to find some time to catch up in the next week or after (I'm mostly OOO next week) so do a mini retrospective on this project and share some more thoughts from my end and listen to yours.

Again, thank you so much for all your work on this. I'm looking forward to syncing up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants