-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor fast delta profiling #1563
refactor fast delta profiling #1563
Conversation
This wasn't the main goal, but it seems like the big-heap.pprof bench also benefits from this a lot. Since this benchmark is the most realistic, that's very nice. benchstat before.txt after.txt name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 844µs ± 3% 865µs ± 1% ~ (p=0.095 n=5+5) FastDelta/testdata/big-heap.pprof-12 10.6ms ± 2% 10.7ms ± 1% ~ (p=0.151 n=5+5) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 175kB ± 0% 216kB ± 0% +23.61% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 1.53MB ± 0% 1.54MB ± 0% +0.76% (p=0.008 n=5+5) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 178 ± 0% 187 ± 0% +5.06% (p=0.016 n=4+5) FastDelta/testdata/big-heap.pprof-12 1.17k ± 0% 0.50k ± 0% -57.27% (p=0.008 n=5+5)
benchstat after-3.txt after-4.txt name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 889µs ± 1% 851µs ± 0% -4.28% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 11.1ms ± 1% 10.4ms ± 0% -5.70% (p=0.008 n=5+5) name old speed new speed delta FastDelta/testdata/heap.pprof-12 29.7MB/s ± 1% 31.0MB/s ± 0% +4.47% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 28.1MB/s ± 1% 29.8MB/s ± 0% +6.05% (p=0.008 n=5+5) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 216kB ± 0% 209kB ± 0% -3.46% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 1.54MB ± 0% 1.44MB ± 0% -6.59% (p=0.008 n=5+5) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 187 ± 0% 160 ± 0% -14.04% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 503 ± 0% 388 ± 1% -22.90% (p=0.008 n=5+5)
Right now a failure of this test was spitting the entire binary dump of he profile to stdout.
benchstat says its not significant, but it looks like a 7% win to me and can be reproduced. Probably would need to run this on a metal host to get more stable numbers. name old time/op new time/op delta FastDelta/testdata/heap.pprof-10 267µs ± 0% 246µs ± 0% ~ (p=0.200 n=1+9) name old speed new speed delta FastDelta/testdata/heap.pprof-10 99.0MB/s ± 0% 107.3MB/s ± 0% ~ (p=0.200 n=1+9) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-10 532B ± 0% 530B ± 1% ~ (p=0.364 n=1+10) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-10 20.0 ± 0% 20.0 ± 0% ~ (all equal)
still hacky and needs cleaning up ... but this is neat :)
As this is a pretty significant rewrite, another round of fidelity check testing in relenv and testing on prod instances is warranted. The main fastdelta branch has gone through all that and we have built some confidence in it. Does this rewrite risk missing 1.44.0? |
…feature-fast-delta-profiling-3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some error messages and comments need cleaned up, but overall I like the refactor.
Some general observations.
- Most of the memory optimizations could have been made on the original implementation.
- One thing I like about the original implementation was the readability. Having the structure of the pprof stream un-abstracted arguably makes the code simpler and easier to comprehend as a whole.
- Abstracting the pprof parsing into
pproflite
might make this more re-usable. With fastdelta as the only dependent, YAGNI applies. But I think making the pprof parsing more general as this refactor does is a good trade-off.
Ideally, this near-complete-rewrite would have come after the review on #1511 was completed and we'd already have 1.44.0 released.
But given we've gone ahead and made bug fixes to this branch instead and effectively made this the working branch for the past week, my vote is to go ahead and get it merged.
I won't have time this week to run some production tests on this branch as we have already done with feature-fast-delta-profiling
. So this is what I'd like to do:
- get this merged into
feature-fast-delta
ASAP - get @nsrip-dd 's fix for profiler/internal/fastdelta: handle duplicate samples #1571 merged. Nick and I chatted out-of-band, and my vote is to pay the cost of another hash pass so we separate aggregation and sample writing. This rather than aggregating all samples in-memory then writing them out. He has a fix in progress based on this branch's refactor, another reason to merge this ASAP. We both somehow completely missed the failed fidelity checks, mostly because it appears logs aren't being collected for the reliability env pods. I'd not observed failure messages when searching logs, but it turns out the logs are just not there. 🤦
- next week, deploy
feature-fast-delta-profiling
onto a production shadow (again). - cut 1.44 with fastdelta disabled by default.
@nsrip-dd and I discussed 4 at length out-of-band.
Even though we might have more confidence than a week ago that we've tested throughly:
- the upside is low - the performance improvement is not going to be noticed by most services
- the downside of something missed is high, forcing customers to downgrade or deploy an env var during the holidays.
Given the low upside and high downside, I do not endorse releasing fastdelta with it enabled by default on the first release.
I do not think deploying on our staging services is a good test for this case, at least not without significant effort from us. Most of our highest-throughput services (metrics) are not actively tended to there as the teams' workflow is focused only test-in-prod techniques (shadows, canaries, etc). Any verification and load testing in staging would be left entirely to us.
If released behind an env var flag, getting teams to opt into the new feature is an easy sell if their service is dominated by delta profile resource utilization and an easy roll-back if something goes wrong.
profiler/internal/fastdelta/fd.go
Outdated
func(f pproflite.Field) error { | ||
fn, ok := f.(*pproflite.Function) | ||
if !ok { | ||
return fmt.Errorf("functionPass: unexpected field: %T", f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error message
Heap profiles from the Go runtime sometimes contain multiple samples with the same call stack and labels. Normally we would expect such samples to be aggregated, and in fact our implementation assumes that is the case. However, profiles with multiple samples which should have been aggregated are actually common. The original google/pprof-based delta implementation aggregated these duplicate samples, but our implementation did not. This resulted in disagreements between the two implementations. Multiple such profiles were found by running the TestRepeatedHeapProfile stress test and have been added to the test corpus, and this has also been observed in internal testing. This is possibly unintentional behavior from the runtime, but we can account for it until it is fully diagnosed and fixed upstream. This commit adds another pass to aggregate duplicate samples before the pass which diffs the previous and current sample values. This, unfortunately, eats into some of our performance gains. However, we still maintain a significant performance improvement over the original implementation. name old time/op new time/op delta FastDelta/testdata/heap.pprof/setup-8 934µs ± 7% 1414µs ± 5% +51.40% (p=0.000 n=8+9) FastDelta/testdata/heap.pprof/steady-state-8 329µs ± 9% 605µs ±22% +83.61% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/setup-8 10.4ms ± 3% 17.7ms ± 9% +69.22% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 4.03ms ± 2% 6.96ms ± 7% +72.81% (p=0.000 n=9+10) MakeGolden/testdata/heap.pprof-8 3.91ms ± 3% 3.90ms ± 4% ~ (p=0.604 n=9+10) MakeGolden/testdata/big-heap.pprof-8 53.2ms ±19% 45.2ms ± 2% -15.06% (p=0.000 n=10+9) name old speed new speed delta FastDelta/testdata/heap.pprof/setup-8 28.3MB/s ± 7% 18.7MB/s ± 5% -33.99% (p=0.000 n=8+9) FastDelta/testdata/heap.pprof/steady-state-8 80.3MB/s ± 9% 44.2MB/s ±19% -44.96% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/setup-8 29.8MB/s ± 3% 17.7MB/s ± 8% -40.79% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 77.3MB/s ± 2% 44.8MB/s ± 6% -42.07% (p=0.000 n=9+10) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof/setup-8 151kB ± 0% 242kB ± 0% +60.30% (p=0.000 n=10+10) FastDelta/testdata/heap.pprof/steady-state-8 0.00B 0.00B ~ (all equal) FastDelta/testdata/big-heap.pprof/setup-8 946kB ± 0% 1713kB ± 0% +81.06% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 33.4B ± 8% 39.2B ±13% +17.21% (p=0.000 n=9+10) MakeGolden/testdata/heap.pprof-8 2.98MB ± 0% 2.98MB ± 0% -0.00% (p=0.000 n=8+10) MakeGolden/testdata/big-heap.pprof-8 35.4MB ± 0% 35.4MB ± 0% ~ (p=0.089 n=10+10) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof/setup-8 138 ± 1% 139 ± 1% +0.72% (p=0.001 n=10+10) FastDelta/testdata/heap.pprof/steady-state-8 0.00 0.00 ~ (all equal) FastDelta/testdata/big-heap.pprof/setup-8 313 ± 1% 391 ± 1% +24.86% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof/steady-state-8 0.00 0.00 ~ (all equal) MakeGolden/testdata/heap.pprof-8 41.2k ± 0% 41.2k ± 0% ~ (all equal) MakeGolden/testdata/big-heap.pprof-8 524k ± 0% 524k ± 0% -0.00% (p=0.037 n=10+10) name old heap-alloc/op new heap-alloc/op delta FastDelta/testdata/heap.pprof/steady-state-8 162kB ±237% 0kB ~ (p=0.173 n=10+9) FastDelta/testdata/big-heap.pprof/steady-state-8 538kB ±130% 601kB ±100% ~ (p=0.378 n=10+10)
name old time/op new time/op delta Delta/heap.pprof/setup-10 2.69ms ± 1% 0.92ms ± 0% -65.88% (p=0.000 n=10+8) Delta/heap.pprof/steady-state-10 2.05ms ± 0% 0.38ms ± 0% -81.53% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 32.1ms ± 1% 10.7ms ± 0% -66.56% (p=0.000 n=9+8) Delta/big-heap.pprof/steady-state-10 26.5ms ± 2% 4.7ms ± 0% -82.36% (p=0.000 n=10+9) name old speed new speed delta Delta/heap.pprof/setup-10 9.82MB/s ± 1% 28.78MB/s ± 0% +193.05% (p=0.000 n=10+8) Delta/heap.pprof/steady-state-10 12.9MB/s ± 0% 69.6MB/s ± 0% +441.42% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 9.68MB/s ± 1% 28.96MB/s ± 0% +199.07% (p=0.000 n=9+8) Delta/big-heap.pprof/steady-state-10 11.7MB/s ± 2% 66.5MB/s ± 0% +466.96% (p=0.000 n=10+9) name old alloc/op new alloc/op delta Delta/heap.pprof/setup-10 2.99MB ± 0% 0.24MB ± 0% -91.94% (p=0.000 n=10+10) Delta/heap.pprof/steady-state-10 2.38MB ± 0% 0.00MB -100.00% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 36.1MB ± 0% 1.7MB ± 0% -95.26% (p=0.000 n=10+10) Delta/big-heap.pprof/steady-state-10 28.7MB ± 0% 0.0MB -100.00% (p=0.000 n=9+10) name old allocs/op new allocs/op delta Delta/heap.pprof/setup-10 41.2k ± 0% 0.1k ± 0% -99.67% (p=0.000 n=10+10) Delta/heap.pprof/steady-state-10 34.3k ± 0% 0.0k -100.00% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 527k ± 0% 0k ± 0% -99.93% (p=0.000 n=10+9) Delta/big-heap.pprof/steady-state-10 450k ± 0% 0k -100.00% (p=0.000 n=9+10) name old heap-inuse-alloc/op new heap-inuse-alloc/op delta Delta/heap.pprof/steady-state-10 406kB ± 0% 140kB ± 0% -65.44% (p=0.000 n=9+10) Delta/big-heap.pprof/steady-state-10 4.58MB ± 0% 0.81MB ± 1% -82.23% (p=0.000 n=9+10)
@pmbauer thank you so much for your thoughtful review and comments on this. I failed to proactively communicate on some of my work here, but the gist is that it was very much a result of my code review. I mostly wanted to make sure I understand everything very deeply and not make suggestions that I wasn't confident in being possible. It's difficult with performance sensitive code like this, because it usually is at odds with abstraction and separation of concerns. Both Nick and I are aligned with you on shipping this to customers off by default and doing more battle hardening internally before changing that. But I also want to find some time to catch up in the next week or after (I'm mostly OOO next week) so do a mini retrospective on this project and share some more thoughts from my end and listen to yours. Again, thank you so much for all your work on this. I'm looking forward to syncing up. |
This PR attempts to create a better separation of concerns in the code while fixing a few bugs and improving performance along the way.