New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Delta Profiling #1511
Fast Delta Profiling #1511
Conversation
- vendor fastdelta library from DataDog/fastdelta@358911a - anonymized big-heap.pprof, updated corresponding tests Co-authored-by: Nick Ripley <nick.ripley@datadoghq.com>
- controls which delta method the profiler uses - defaults to pprofile (pprofile.Scale+pprofile.Merge) - other options: fastdelta, comparing - added test coverage for all options
somehow lost goimports-on-save in my local dev; fixed
name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 924µs ± 2% 827µs ± 3% -10.53% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 12.2ms ± 4% 10.9ms ± 4% -10.29% (p=0.008 n=5+5) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 289kB ± 0% 226kB ± 0% -22.00% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 2.44MB ± 0% 1.71MB ± 0% -29.77% (p=0.008 n=5+5) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 3.33k ± 0% 1.34k ± 0% -59.66% (p=0.008 n=5+5) FastDelta/testdata/big-heap.pprof-12 37.2k ± 0% 14.5k ± 0% -61.10% (p=0.008 n=5+5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Paul!
- There is one breaking change around the
StatsdClient
interface that definitely needs to be addressed. I'm wondering if we should keep the timing stuff, or just the old/new consistency check (so we can still run the implementations side-by-side just to make sure they agree content-wise) - I've also left some comments more related to the way the code is structured. If those suggestions are too much for this PR, especially related to the nesting of the different pass functions, we can address them in a subsequent PR.
// f builds a profile with a single sample which has labels in the given | ||
// order. We build the profile ourselves because we can control the | ||
// precise binary encoding of the profile. | ||
f := func(labels ...string) []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize I wrote this test, but I'm wondering if there's a nicer way to test the hashing consistency than hand-writing a pprof to get the encoding we want? We could perhaps construct a minimal DeltaComputer
with enough state populated to just give some inputs to readSample
, and test that directly. On the other hand, I kind of like this more black-box style test that only calls DeltaComputers
"public" methods... At the very least, I'll leave this comment here as a breadcrumb, maybe we come back to this in a subsequent PR.
- this was over-fitting for benchmarks
Add minimal fuzz testing, and address several issues surfaced by the fuzzer: * Bound check string table indices * Bound check when accessing the list of included strings * Bound check the location index * Check that packed varints are valid, and not too large * Check the number of values per sample Any of these issues could have been identified _without_ the fuzzer by thinking about bounds for every access, but the fuzzer surfaced all of them quickly and we now have repeatable bad inputs to test against. The motivation for this is to have some assurance that the fast delta computer won't fail dramatically if the underlying input changes in some unexpected way, or is invalid due to a Go runtime bug. Also, add a failsafe for crashes/errors to reset the state. If delta profile computation fails, then the DeltaComputer's internal state will be in some indeterminate form. Subsequent calls to Delta given the bad state will return junk. The DeltaComputer should basically be reset to a fresh start.
Nope, all good for me! IMO these changes are OK to merge as-is especially since this is off-by-default behind an environment variable. I'll check in with Felix tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to completely review this today, but it's a lot of code 😅. That being said, I made some good progress. Most of my suggestions can be applied by merging this PR: #1550
@@ -208,6 +211,7 @@ func defaultConfig() (*config, error) { | |||
uploadTimeout: DefaultUploadTimeout, | |||
maxGoroutinesWait: 1000, // arbitrary value, should limit STW to ~30ms | |||
deltaProfiles: internal.BoolEnv("DD_PROFILING_DELTA", true), | |||
deltaMethod: os.Getenv("DD_PROFILING_DELTA_METHOD"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Assuming we're ready to truly release this, change this to default to "fastdelta"
before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Paris, we agreed to release this behind a flag, disabled by default. I still think we should release this in stages.
- 1.44.0 with both implementations, old algorithm as default, new algorithm enabled with an env var. This is so we can continue testing internally. dd-go apps are not going to go to prod with a non-release dd-trace-go. Customers that need it can opt-in.
- 1.4?.0 still include both algorithms, make fastdelta the default, enable old algorithm with an env var fallback.
return nil, fmt.Errorf("error flushing gzip writer: %v", err) | ||
} | ||
return fdp.buf.Bytes(), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Why is the gzip stuff handled here rather than inside the fastdelta.DeltaComputer
implementation? The old implementation also dealt with it inside of pprofile.ParseData()
and deltaProf.Write()
, so it's a bit odd to operate at a different level of abstraction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was something I asked for when fastdelta was its own library.
Compression is an orthogonal concern to the process of computing delta's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't feel strongly about it.
|
||
// insert associates the given address, mapping ID, and function IDs with the | ||
// given location ID | ||
func (l *locationIndex) insert(id, address, mappingID uint64, functionIDs []uint64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: It'd be nice to make this method public in order to indicate, that unlike e.g. insertFast
and insertSlow
, this method is supposed to be used by outside callers. The same applies to the reset
, get
and getMeta
method of this struct.
I'd say it's generally a good pattern for struct methods (and fields) to lowercase if they are only used by the struct itself, and uppercase if they are used in other places of the code. Of course this has no meaning to the compiler inside the same package, but it helps human readers and also simplifies the process of moving code to different packages over time.
func (l *locationIndex) insert(id, address, mappingID uint64, functionIDs []uint64) { | |
func (l *locationIndex) Insert(id, address, mappingID uint64, functionIDs []uint64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it used to live in fd.go
and I don't remember why it was moved out; @nsrip-dd, do you remember? I don't think we'd ever expose this to outside callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to another file because I felt it was a somewhat trickier component that I'd want to look at on its own. As for exposing it to outside callers, Felix's point is that having the methods we actually intend to use capitalized is a reminder for us of what the intended interface is for this type. If this did live in another package, Insert, Get, and GetMeta would be the methods we actually expose. The rest would be implementation details. That's also reflected in the current tests, which only access locationIndex
through those methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM; thanks
Co-authored-by: Felix Geisendörfer <felix@datadoghq.com>
1. NIT: use bytes.HasPrefix 2. NIT: Uppercase public locationIndex methods 3. location_index: simplify and -15% allocs name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 921µs ± 0% 907µs ± 1% -1.52% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof-12 11.8ms ± 1% 11.6ms ± 1% -1.73% (p=0.000 n=10+8) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 201kB ± 0% 199kB ± 0% -0.99% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof-12 1.77MB ± 0% 1.80MB ± 0% +1.93% (p=0.000 n=10+10) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 1.39k ± 0% 1.17k ± 0% -16.15% (p=0.002 n=8+10) FastDelta/testdata/big-heap.pprof-12 14.6k ± 0% 12.5k ± 0% -14.09% (p=0.000 n=10+10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might have found a bug, PTAL.
profiler/internal/fastdelta/fd.go
Outdated
} | ||
return true, nil | ||
}) | ||
sort.Sort(byHash(dc.hashes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞 dc.hashes
contains hashes of labels and locations. By sorting it, samples with the same set of location ids ordered in different orders will produce the same final hash. I think that's wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. My intention for sorting was to consistently order the labels, but you're right that the call stack shouldn't be touched. I'll push a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also FWIW technically with more that 2 locations, the location IDs are packed and we hash them in one go here: https://github.com/DataDog/dd-trace-go/pull/1511/files/2b52f5429d013cd2a03f37d58ecf42d78fc21fbd#diff-5a42c40b2af7c9a2c14a5f80489bffbe90130cfae4795cfc311c2fa768fee0e5R615
So I think this should only affect samples with exactly 2 locations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1558
Order matters for hashing the addresses in a call stack. If the sample has repeated single-value location IDs (rather than packed), we currently hash them one at a time and append the hash to the hashes slice, then sort it. But this gets rid of the order the addresses appear. To be fully robust, we should collect all the addresses first, and then hash them after we've visited every location.
fastdelta: reduce allocs by up to 90% Remove allocation from sort.Sort call (big win) and also only call it if needed (small win). Also simplify the hashing a bit, no need to hash the location list and then re-hash it again, it can all go into the same hash right away. $ benchstat before.txt after.txt name old time/op new time/op delta FastDelta/testdata/heap.pprof-12 923µs ± 1% 833µs ± 1% -9.75% (p=0.000 n=9+10) FastDelta/testdata/big-heap.pprof-12 11.6ms ± 1% 10.5ms ± 2% -9.32% (p=0.000 n=9+9) name old alloc/op new alloc/op delta FastDelta/testdata/heap.pprof-12 199kB ± 0% 175kB ± 0% -12.02% (p=0.000 n=10+10) FastDelta/testdata/big-heap.pprof-12 1.80MB ± 0% 1.53MB ± 0% -15.11% (p=0.000 n=10+10) name old allocs/op new allocs/op delta FastDelta/testdata/heap.pprof-12 1.17k ± 0% 0.18k ± 0% -84.86% (p=0.000 n=8+10) FastDelta/testdata/big-heap.pprof-12 12.5k ± 0% 1.2k ± 0% -90.61% (p=0.000 n=10+10) Co-authored-by: Nick Ripley <97066770+nsrip-dd@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞 Maybe I'm missing something, but it seems like the entire stringsTable implementation assumes that string table ID references remain stable between profiles? I can see how that might be a reasonable assumption for the profiles currently emitted by the go runtime, but I'm worried that it will be very difficult to adept our implementation if this invariant gets broken in the future.
err = dc.Delta(after, buf) | ||
if err != nil { | ||
b.Fatal(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inuse_*
counts. I took a closer look at how that works out for the test files we have:
profile | before_samples | after_samples | % |
---|---|---|---|
heap.pprof | 497 | 5 | 1% |
big-heap.pprof | 5676 | 686 | 12% |
I don't think we'll encounter many cases like heap.pprof
in the real world where only 1% of samples are retained from the initial profile. big-heap.pprof
retaining 12% is probably more realistic, but again, I'm still a little worried that we're not actually benchmarking any delta computation here. The benchmark just removes all samples that have insuse_*
counts of 0.
Unfortunately I think it will be difficult to construct a better benchmark. We'd probably have to constructor b.N
derivates of our input profile with monotonically increasing counts before starting the benchmark itself. It can be done, but will cause a lot of memory usage and increase the startup time of the benchmarks significantly.
So maybe we should just document the limitation of the benchmarks to avoid accidentally overfitting our implementation to the somewhat unrealistic workload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We could re-work this and even incorporate b.N
into the test input. I don't think that would be too difficult.
It doesn't measure any delta computation.
Do you mean the one actual delta in the bench results in no non-zero samples for the value types configured? (before
and after
are identical, yeah we def don't need them both). The table you have is only counting the non-alloc_*
samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is only one Delta
computation, the current benchmark is under-representing the gains for processes that last longer than 2 profiling cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is only one
Delta
computation, the current benchmark is under-representing the gains for processes that last longer than 2 profiling cycles.
Yes, that's another problem with this benchmark! I think it'd be better if it create a single DeltaComputer and then applied pprofs against it in the loop rather than creating a new computer in each loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This bench could use some more work. I'll have some updates later today.
profiler/internal/fastdelta/fd.go
Outdated
return false, err | ||
} | ||
dc.keepLocations(dc.scratchIDs) | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞 Since there is no includeStringIndexFields
call here, samples with labels that go through this code pass may have string table references that don't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
In practice, we throw away the first delta and the comparingDeltaProfiler
didn't catch this one, but will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
In practice, we throw away the first delta and the comparingDeltaProfiler didn't catch this one, but will fix.
Maybe hold off on that fix. I have a branch with some more refactorings where I can easily fix this as well.
The stringTable is not used between profiles and is reset at the start of each delta. Unless I'm misunderstanding something, we don't assume the IDs are stable between profiles. The strings that get written out to the delta pprof will have the same offsets as the input pprof passed to |
💡 Ah, got it. For some reason I thought that there might be references into the string table from the previous profile. But I can see now that this is clearly not the case. All references are from the current profile, so it doesn't matter if the same string had a different index in the previous profile. ✅ |
Add a test that repeatedly generates real heap profiles, each time creating more garbage, and checking the fidelity of the delta against the original implementation. The garbage is created using mutual recursion so that there are many new unique call stacks for each iteration. This is inspired by seeing sorting in a profile where left and right recursive merging lead to an exponential number of unique call stacks. The test is too resource intensive to run on CI. Use an env var to enable it, so we can run it locally.
Comparison of old pprof implementation vs fastdelta: name old time/op new time/op delta Delta/heap.pprof/setup-10 2.69ms ± 1% 0.92ms ± 0% -65.88% (p=0.000 n=10+8) Delta/heap.pprof/steady-state-10 2.05ms ± 0% 0.38ms ± 0% -81.53% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 32.1ms ± 1% 10.7ms ± 0% -66.56% (p=0.000 n=9+8) Delta/big-heap.pprof/steady-state-10 26.5ms ± 2% 4.7ms ± 0% -82.36% (p=0.000 n=10+9) name old speed new speed delta Delta/heap.pprof/setup-10 9.82MB/s ± 1% 28.78MB/s ± 0% +193.05% (p=0.000 n=10+8) Delta/heap.pprof/steady-state-10 12.9MB/s ± 0% 69.6MB/s ± 0% +441.42% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 9.68MB/s ± 1% 28.96MB/s ± 0% +199.07% (p=0.000 n=9+8) Delta/big-heap.pprof/steady-state-10 11.7MB/s ± 2% 66.5MB/s ± 0% +466.96% (p=0.000 n=10+9) name old alloc/op new alloc/op delta Delta/heap.pprof/setup-10 2.99MB ± 0% 0.24MB ± 0% -91.94% (p=0.000 n=10+10) Delta/heap.pprof/steady-state-10 2.38MB ± 0% 0.00MB -100.00% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 36.1MB ± 0% 1.7MB ± 0% -95.26% (p=0.000 n=10+10) Delta/big-heap.pprof/steady-state-10 28.7MB ± 0% 0.0MB -100.00% (p=0.000 n=9+10) name old allocs/op new allocs/op delta Delta/heap.pprof/setup-10 41.2k ± 0% 0.1k ± 0% -99.67% (p=0.000 n=10+10) Delta/heap.pprof/steady-state-10 34.3k ± 0% 0.0k -100.00% (p=0.000 n=10+10) Delta/big-heap.pprof/setup-10 527k ± 0% 0k ± 0% -99.93% (p=0.000 n=10+9) Delta/big-heap.pprof/steady-state-10 450k ± 0% 0k -100.00% (p=0.000 n=9+10) name old heap-inuse-alloc/op new heap-inuse-alloc/op delta Delta/heap.pprof/steady-state-10 406kB ± 0% 140kB ± 0% -65.44% (p=0.000 n=9+10) Delta/big-heap.pprof/steady-state-10 4.58MB ± 0% 0.81MB ± 1% -82.23% (p=0.000 n=9+10) Co-authored-by: Nick Ripley <nick.ripley@datadoghq.com>
The PprofDiff function was added to our public API unintentionally as part of #1511. The function is only meant for internal use. This was missed during code review, likely due to the size of the change and the fact that most of our attention was devoted to making sure the core parts of the change were implemented correctly. This is an exception to our normal v1 compatibility guarantees. It is very unlikely that any of our users depend on this function. It serves no purpose in the context of the profiler's public API, and anybody who might have a reason to use this function would probably have already found a way to do what they want using the github.com/google/pprof/profile package.
The PprofDiff function was added to our public API unintentionally as part of #1511. The function is only meant for internal use. This was missed during code review, likely due to the size of the change and the fact that most of our attention was devoted to making sure the core parts of the change were implemented correctly. Mark the function as deprecated, to be removed in a later release. This should hopefully give users a chance to notice that it's going away and stop using it before it's removed. pkg.go.dev will show that the the function is deprecated, and some IDEs and linters will flag use of deprecated functions. This is an exception to our normal v1 compatibility guarantees. It is very unlikely that any of our users depend on this function. It serves no purpose in the context of the profiler's public API, and anybody who might have a reason to use this function would probably have already found a way to do what they want using the github.com/google/pprof/profile package.
Description
The pprof protobuf structure when hydrated in-memory as golang structs has significant overhead in memory and allocations. The structure has multiple nested self-referencing sub-structs that cause GC overhead to the point of dominating CPU resources for some services.
We need to keep this structure in memory for computing delta profiles between profile cycles, making our profiling client a cause for performance regressions in some cases.
This pull adds a new delta profile implementation -
fastdelta
- that uses a streaming pprof parser and low heap allocations so the entire hydrated pprof structure no longer needs to be kept in memory between profile cycles.Testing Done
DD_PROFILING_DELTA_METHOD=comparing
, no fidelity failuresDD_PROFILING_DELTA_METHOD=fastdelta
Deploy Plan
dd-trace-go
release with both delta profile implementations, old algorithm as default, new algorithm enabled withDD_PROFILING_DELTA_METHOD
env var. This is so we can continue testing internally.fastdelta
the default, enable old algorithm with an env var fallback if there are any issues.Results
On synthetic benchmarks
Change Summary
duration_nanos
tracking