refactor fast delta profiling #1563

felixge · 2022-11-02T22:38:38Z

This PR attempts to create a better separation of concerns in the code while fixing a few bugs and improving performance along the way.

This wasn't the main goal, but it seems like the big-heap.pprof bench also benefits from this a lot. Since this benchmark is the most realistic, that's very nice. benchstat before.txt after.txt name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 844µs ± 3% 865µs ± 1% ~ (p=0.095 n=5+5) FastDelta/testdata/big-heap.pprof-12 10.6ms ± 2% 10.7ms ± 1% ~ (p=0.151 n=5+5) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 175kB ± 0% 216kB ± 0% +23.61% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 1.53MB ± 0% 1.54MB ± 0% +0.76% (p=0.008 n=5+5) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 178 ± 0% 187 ± 0% +5.06% (p=0.016 n=4+5) FastDelta/testdata/big-heap.pprof-12 1.17k ± 0% 0.50k ± 0% -57.27% (p=0.008 n=5+5)

benchstat after-3.txt after-4.txt name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 889µs ± 1% 851µs ± 0% -4.28% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 11.1ms ± 1% 10.4ms ± 0% -5.70% (p=0.008 n=5+5) name old speed new speed delta FastDelta/testdata/heap.pprof-12 29.7MB/s ± 1% 31.0MB/s ± 0% +4.47% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 28.1MB/s ± 1% 29.8MB/s ± 0% +6.05% (p=0.008 n=5+5) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 216kB ± 0% 209kB ± 0% -3.46% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 1.54MB ± 0% 1.44MB ± 0% -6.59% (p=0.008 n=5+5) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 187 ± 0% 160 ± 0% -14.04% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 503 ± 0% 388 ± 1% -22.90% (p=0.008 n=5+5)

Right now a failure of this test was spitting the entire binary dump of he profile to stdout.

see #1511 (comment)

benchstat says its not significant, but it looks like a 7% win to me and can be reproduced. Probably would need to run this on a metal host to get more stable numbers. name old time/op new time/op delta FastDelta/testdata/heap.pprof-10 267µs ± 0% 246µs ± 0% ~ (p=0.200 n=1+9) name old speed new speed delta FastDelta/testdata/heap.pprof-10 99.0MB/s ± 0% 107.3MB/s ± 0% ~ (p=0.200 n=1+9) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-10 532B ± 0% 530B ± 1% ~ (p=0.364 n=1+10) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-10 20.0 ± 0% 20.0 ± 0% ~ (all equal)

still hacky and needs cleaning up ... but this is neat :)

pmbauer · 2022-11-06T14:22:16Z

As this is a pretty significant rewrite, another round of fidelity check testing in relenv and testing on prod instances is warranted. The main fastdelta branch has gone through all that and we have built some confidence in it. Does this rewrite risk missing 1.44.0?

…feature-fast-delta-profiling-3

pmbauer

Some error messages and comments need cleaned up, but overall I like the refactor.

Some general observations.

Most of the memory optimizations could have been made on the original implementation.
One thing I like about the original implementation was the readability. Having the structure of the pprof stream un-abstracted arguably makes the code simpler and easier to comprehend as a whole.
Abstracting the pprof parsing into pproflite might make this more re-usable. With fastdelta as the only dependent, YAGNI applies. But I think making the pprof parsing more general as this refactor does is a good trade-off.

Ideally, this near-complete-rewrite would have come after the review on #1511 was completed and we'd already have 1.44.0 released.

But given we've gone ahead and made bug fixes to this branch instead and effectively made this the working branch for the past week, my vote is to go ahead and get it merged.

I won't have time this week to run some production tests on this branch as we have already done with feature-fast-delta-profiling. So this is what I'd like to do:

get this merged into feature-fast-delta ASAP
get @nsrip-dd 's fix for profiler/internal/fastdelta: handle duplicate samples #1571 merged. Nick and I chatted out-of-band, and my vote is to pay the cost of another hash pass so we separate aggregation and sample writing. This rather than aggregating all samples in-memory then writing them out. He has a fix in progress based on this branch's refactor, another reason to merge this ASAP. We both somehow completely missed the failed fidelity checks, mostly because it appears logs aren't being collected for the reliability env pods. I'd not observed failure messages when searching logs, but it turns out the logs are just not there. 🤦
next week, deploy feature-fast-delta-profiling onto a production shadow (again).
cut 1.44 with fastdelta disabled by default.

@nsrip-dd and I discussed 4 at length out-of-band.
Even though we might have more confidence than a week ago that we've tested throughly:

the upside is low - the performance improvement is not going to be noticed by most services
the downside of something missed is high, forcing customers to downgrade or deploy an env var during the holidays.

Given the low upside and high downside, I do not endorse releasing fastdelta with it enabled by default on the first release.

I do not think deploying on our staging services is a good test for this case, at least not without significant effort from us. Most of our highest-throughput services (metrics) are not actively tended to there as the teams' workflow is focused only test-in-prod techniques (shadows, canaries, etc). Any verification and load testing in staging would be left entirely to us.

If released behind an env var flag, getting teams to opt into the new feature is an easy sell if their service is dominated by delta profile resource utilization and an easy roll-back if something goes wrong.

profiler/internal/fastdelta/fd.go

pmbauer · 2022-11-09T15:19:29Z

profiler/internal/fastdelta/fd.go

+		func(f pproflite.Field) error {
+			fn, ok := f.(*pproflite.Function)
+			if !ok {
+				return fmt.Errorf("functionPass: unexpected field: %T", f)


error message

profiler/internal/fastdelta/location_index.go

profiler/internal/pproflite/decoder.go

Heap profiles from the Go runtime sometimes contain multiple samples with the same call stack and labels. Normally we would expect such samples to be aggregated, and in fact our implementation assumes that is the case. However, profiles with multiple samples which should have been aggregated are actually common. The original google/pprof-based delta implementation aggregated these duplicate samples, but our implementation did not. This resulted in disagreements between the two implementations. Multiple such profiles were found by running the TestRepeatedHeapProfile stress test and have been added to the test corpus, and this has also been observed in internal testing. This is possibly unintentional behavior from the runtime, but we can account for it until it is fully diagnosed and fixed upstream. This commit adds another pass to aggregate duplicate samples before the pass which diffs the previous and current sample values. This, unfortunately, eats into some of our performance gains. However, we still maintain a significant performance improvement over the original implementation. name old time/op new time/op delta FastDelta/testdata/heap.pprof/setup-8 934µs ± 7% 1414µs ± 5% +51.40% (p=0.000 n=8+9) FastDelta/testdata/heap.pprof/steady-state-8 329µs ± 9% 605µs ±22% +83.61% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/setup-8 10.4ms ± 3% 17.7ms ± 9% +69.22% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 4.03ms ± 2% 6.96ms ± 7% +72.81% (p=0.000 n=9+10) MakeGolden/testdata/heap.pprof-8 3.91ms ± 3% 3.90ms ± 4% ~ (p=0.604 n=9+10) MakeGolden/testdata/big-heap.pprof-8 53.2ms ±19% 45.2ms ± 2% -15.06% (p=0.000 n=10+9) name old speed new speed delta FastDelta/testdata/heap.pprof/setup-8 28.3MB/s ± 7% 18.7MB/s ± 5% -33.99% (p=0.000 n=8+9) FastDelta/testdata/heap.pprof/steady-state-8 80.3MB/s ± 9% 44.2MB/s ±19% -44.96% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/setup-8 29.8MB/s ± 3% 17.7MB/s ± 8% -40.79% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 77.3MB/s ± 2% 44.8MB/s ± 6% -42.07% (p=0.000 n=9+10) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof/setup-8 151kB ± 0% 242kB ± 0% +60.30% (p=0.000 n=10+10) FastDelta/testdata/heap.pprof/steady-state-8 0.00B 0.00B ~ (all equal) FastDelta/testdata/big-heap.pprof/setup-8 946kB ± 0% 1713kB ± 0% +81.06% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 33.4B ± 8% 39.2B ±13% +17.21% (p=0.000 n=9+10) MakeGolden/testdata/heap.pprof-8 2.98MB ± 0% 2.98MB ± 0% -0.00% (p=0.000 n=8+10) MakeGolden/testdata/big-heap.pprof-8 35.4MB ± 0% 35.4MB ± 0% ~ (p=0.089 n=10+10) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof/setup-8 138 ± 1% 139 ± 1% +0.72% (p=0.001 n=10+10) FastDelta/testdata/heap.pprof/steady-state-8 0.00 0.00 ~ (all equal) FastDelta/testdata/big-heap.pprof/setup-8 313 ± 1% 391 ± 1% +24.86% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 0.00 0.00 ~ (all equal) MakeGolden/testdata/heap.pprof-8 41.2k ± 0% 41.2k ± 0% ~ (all equal) MakeGolden/testdata/big-heap.pprof-8 524k ± 0% 524k ± 0% -0.00% (p=0.037 n=10+10) name old heap-alloc/op new heap-alloc/op delta FastDelta/testdata/heap.pprof/steady-state-8 162kB ±237% 0kB ~ (p=0.173 n=10+9) FastDelta/testdata/big-heap.pprof/steady-state-8 538kB ±130% 601kB ±100% ~ (p=0.378 n=10+10)

name old time/op new time/op delta Delta/heap.pprof/setup-10 2.69ms ± 1% 0.92ms ± 0% -65.88% (p=0.000 n=10+8) Delta/heap.pprof/steady-state-10 2.05ms ± 0% 0.38ms ± 0% -81.53% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 32.1ms ± 1% 10.7ms ± 0% -66.56% (p=0.000 n=9+8) Delta/big-heap.pprof/steady-state-10 26.5ms ± 2% 4.7ms ± 0% -82.36% (p=0.000 n=10+9) name old speed new speed delta Delta/heap.pprof/setup-10 9.82MB/s ± 1% 28.78MB/s ± 0% +193.05% (p=0.000 n=10+8) Delta/heap.pprof/steady-state-10 12.9MB/s ± 0% 69.6MB/s ± 0% +441.42% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 9.68MB/s ± 1% 28.96MB/s ± 0% +199.07% (p=0.000 n=9+8) Delta/big-heap.pprof/steady-state-10 11.7MB/s ± 2% 66.5MB/s ± 0% +466.96% (p=0.000 n=10+9) name old alloc/op new alloc/op delta Delta/heap.pprof/setup-10 2.99MB ± 0% 0.24MB ± 0% -91.94% (p=0.000 n=10+10) Delta/heap.pprof/steady-state-10 2.38MB ± 0% 0.00MB -100.00% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 36.1MB ± 0% 1.7MB ± 0% -95.26% (p=0.000 n=10+10) Delta/big-heap.pprof/steady-state-10 28.7MB ± 0% 0.0MB -100.00% (p=0.000 n=9+10) name old allocs/op new allocs/op delta Delta/heap.pprof/setup-10 41.2k ± 0% 0.1k ± 0% -99.67% (p=0.000 n=10+10) Delta/heap.pprof/steady-state-10 34.3k ± 0% 0.0k -100.00% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 527k ± 0% 0k ± 0% -99.93% (p=0.000 n=10+9) Delta/big-heap.pprof/steady-state-10 450k ± 0% 0k -100.00% (p=0.000 n=9+10) name old heap-inuse-alloc/op new heap-inuse-alloc/op delta Delta/heap.pprof/steady-state-10 406kB ± 0% 140kB ± 0% -65.44% (p=0.000 n=9+10) Delta/big-heap.pprof/steady-state-10 4.58MB ± 0% 0.81MB ± 1% -82.23% (p=0.000 n=9+10)

felixge · 2022-11-12T03:00:17Z

@pmbauer thank you so much for your thoughtful review and comments on this. I failed to proactively communicate on some of my work here, but the gist is that it was very much a result of my code review. I mostly wanted to make sure I understand everything very deeply and not make suggestions that I wasn't confident in being possible. It's difficult with performance sensitive code like this, because it usually is at odds with abstraction and separation of concerns.

Both Nick and I are aligned with you on shipping this to customers off by default and doing more battle hardening internally before changing that. But I also want to find some time to catch up in the next week or after (I'm mostly OOO next week) so do a mini retrospective on this project and share some more thoughts from my end and listen to yours.

Again, thank you so much for all your work on this. I'm looking forward to syncing up.

felixge added 11 commits November 2, 2022 23:21

structured sample decoding

00afcec

refactor hashing

cf10043

refactor location parsing

d9a9bf2

report MB/s in bench

9b4bd67

bench: make more realistic

4579677

Remove indirection

bf28fb6

test: produce better error message

cee5bae

Right now a failure of this test was spitting the entire binary dump of he profile to stdout.

test: fixed expected/wanted order

c8abcc7

fix: new samples don't have the strings included

c0f3aa4

see #1511 (comment)

felixge changed the title ~~Felix.geisendoerfer/feature fast delta profiling 3~~ refactor fast delta profiling Nov 3, 2022

felixge added 15 commits November 5, 2022 08:59

Auto-reset on decode

894959b

implement value type, better resetting

89460bf

New indexPass

711d524

Start moving pprof stuff into its own pkg

7769bb4

move sample stuff to pproflite

5508f3c

delete pprof.go

4aef4a1

handle string table stuff

da47c0a

finish pproflite and add tests

2cc2052

Better testing

ba132d8

switch fastdelta to pproflite

53c3aac

refactor pass code

d5aade9

Move hash code

464d913

remove dead field

b99ecf3

get alloc/op down to 0 (!)

ef50362

still hacky and needs cleaning up ... but this is neat :)

felixge added 2 commits November 7, 2022 05:42

cleanup mess

e68f842

add note

e7152a6

felixge and others added 15 commits November 7, 2022 15:20

DenseIntSet: bitmap implementation

5d35c35

refactor

994e1c9

report heap usage

3518819

tweak

24e3d1a

reduce heap usage

a5b9e87

cleanups

611f10b

faster varint decoding

9f1da20

even faster varint decoding

740b66c

update comments, better pass names

2787ddd

comments

c019313

add copyright headers

48b97f9

delete dead code

729bca7

Merge branch 'feature-fast-delta-profiling' into felix.geisendoerfer/…

96b6f44

…feature-fast-delta-profiling-3

fix imports

1b97cc8

fix linter complaints

996810f

nsrip-dd marked this pull request as ready for review November 7, 2022 23:33

nsrip-dd requested review from a team as code owners November 7, 2022 23:33

pmbauer requested changes Nov 9, 2022

View reviewed changes

nsrip-dd and others added 8 commits November 10, 2022 10:50

fix error message pass names

895b9ce

fix nit

8e74257

fix nit

dd3a12f

fix nits

1209092

remove pprof_records.go

036c09a

make deprecated version of golint happy ...

e46d429

felixge merged commit ebd6ea5 into feature-fast-delta-profiling Nov 11, 2022

felixge deleted the felix.geisendoerfer/feature-fast-delta-profiling-3 branch November 11, 2022 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor fast delta profiling #1563

refactor fast delta profiling #1563

felixge commented Nov 2, 2022 •

edited

pmbauer commented Nov 6, 2022

pmbauer left a comment •

edited

pmbauer Nov 9, 2022

felixge commented Nov 12, 2022 •

edited

refactor fast delta profiling #1563

refactor fast delta profiling #1563

Conversation

felixge commented Nov 2, 2022 • edited

pmbauer commented Nov 6, 2022

pmbauer left a comment • edited

Choose a reason for hiding this comment

pmbauer Nov 9, 2022

Choose a reason for hiding this comment

felixge commented Nov 12, 2022 • edited

felixge commented Nov 2, 2022 •

edited

pmbauer left a comment •

edited

felixge commented Nov 12, 2022 •

edited