perf(profiling): store string data in an arena allocator#227
perf(profiling): store string data in an arena allocator#227morrisonlevi wants to merge 4 commits into
Conversation
|
I've re-ran some of the experiments I mentioned in the memory usage RFC doc:
|
| libdatadog version | RSS Before drop (task manager) | RSS After drop (task manager) | Maximum RSS |
|---|---|---|---|
| v3.0.0 | 2.2 GB | 1.1 GB | 2181076 KB |
| main@62e329fd4ddb7a34 (2023-08-23) | 1.7 GB | 640.2 MB | 1747392 KB |
| levi/string-table | 1.8 GB | 640.3 MB | 1748420 KB |
ruby_overhead_experiment.rb:
(config: {:threads=>16, :depth=>50, :seconds=>60, :timeline_enabled=>true})
| libdatadog version | Before serialization rss | After serialization rss | After GC rss ('shrinking', in the script) |
|---|---|---|---|
| main@62e329fd4ddb7a34 (2023-08-23) | 78484k | 134068k | 134068k |
| levi/string-table | 78732k | 134352k | 85504k |
I think overall the profiles.c benchmark doesn't gain a lot since it doesn't contain that many strings, so most of the overhead is elsewhere.
The Ruby benchmark does get a reduced rss after GC, which does seem to match our expectation that more memory is able to be returned to the OS (although there's still a lot hanging around -- we're still using up more rss than even before serialization; if we were releasing all libdatadog memory we would get below that number).
|
After yesterday's discussion of rss impact, I got curious on what we'd see for
|
f82e26a to
55abfc6
Compare
e489935 to
08cdeb7
Compare
55abfc6 to
b4bbaeb
Compare
0170833 to
9cfe7d6
Compare
|
I re-ran the Ruby test script with the latest version of this branch (6489a16):
|
| libdatadog version | Before serialization rss | After serialization rss | After GC rss ('shrinking', in the script) | Maximum RSS |
|---|---|---|---|---|
| main@8712cad31ce4d3 (2023-09-26) | 78740k | 77004k | 77004k | 79540k |
| levi/string-table | 78568k | 76964k | 76964k | 79528k |
...Unfortunately, I think the current test scripts are not particularly suited to measure the difference in this PR, since both examples/ffi/profiles.c as well as the Ruby example repeat the same stacks again and again, which means very few unique strings.
Maybe the replayer with a realistic complex profile with lots of unique strings may be better here?
danielsn
left a comment
There was a problem hiding this comment.
Mostly looks good, one worry about possible use-after-free
| /// don't currently have metrics on this. | ||
| /// | ||
| /// So... for now, the selected number of pages is arbitrarily chosen. | ||
| pub const GOOD_INITIAL_CAPACITY: usize = 8 * Self::PAGE_SIZE - Self::BUMP_OVERHEAD; |
There was a problem hiding this comment.
Should we make this a tunable parameter to enable easier experimentation? How much does it matter?
This uses `bumpalo::Bump` as an arena allocator to store the string data contiguously with fewer calls to the system allocator. Since the `StringTable` owns the arena and stores references to that data inside other members of the `StringTable`, it is a self- referencing data structure, which is generally unsafe. This uses `ouroboros::self_referencing` to have a safe abstraction for this pattern.
16ee890 to
eb11362
Compare
|
On 50 runs of the replayer using the PHP timeline symfony-demo pprof, the average memory delta was:
This seems within the noise tolerance of each other. |
danielsn
left a comment
There was a problem hiding this comment.
One ⛏️ about adding a comment explaining why we iterate over strings differently than everything else.
| } | ||
|
|
||
| for item in self.strings.into_iter() { | ||
| for item in self.strings.iter() { |
There was a problem hiding this comment.
This is the one case where we don't into_iter. Might be worth a comment explaining why
|
|
||
| /// Returns an iterator over the strings in the table. The items are | ||
| /// returned in the order they were inserted, matching the [StringId]s. | ||
| pub fn iter(&self) -> impl Iterator<Item = &str> { |
There was a problem hiding this comment.
Not necessary, and might be tricky with lifetimes but this got me wondering if we can have an into_iter implementation that drops the underlying storage when the iteration is done.
There was a problem hiding this comment.
It's not worth it, IMO, because we'd have to make another self-referencing struct, and it doesn't actually buy us anything (end result is the same).
|
This is superceded by #404, closing! :) |
What does this PR do?
This uses
bumpalo::Bumpas an arena allocator to store the string data contiguously with fewer calls to the system allocator.Since the
StringTableowns the arena and stores references to that data inside other members of theStringTable, it is a self-referencing data structure, which is generally unsafe. This usesouroboros::self_referencingto have a safe abstraction for this pattern.Motivation
This produces a better memory layout with fewer system allocations. It likely reduces the total memory used, and it may improve the amount of memory returned to the OS when arena chunks are dropped (see below).
Additional Notes
The initial size of the arena is
16384bytes aka 16KiB. It reserves a few bytes for itself. When it's full, it creates another chunk that seems at least as big as the previous chunk. This relieves pressure on the system allocator with string-by-string allocation, and reduces fragmentation.This
StringTableis copied and simplified from the PHP profiler. I intend for the PHP profiler to use this version of theStringTablegoing forward.How to test the change?
This is an internal change only. Nothing special needs to be done for testing.