Skip to content

perf(profiling): store string data in an arena allocator#227

Closed
morrisonlevi wants to merge 4 commits into
mainfrom
levi/string-table
Closed

perf(profiling): store string data in an arena allocator#227
morrisonlevi wants to merge 4 commits into
mainfrom
levi/string-table

Conversation

@morrisonlevi
Copy link
Copy Markdown
Contributor

@morrisonlevi morrisonlevi commented Aug 22, 2023

What does this PR do?

This uses bumpalo::Bump as an arena allocator to store the string data contiguously with fewer calls to the system allocator.

Since the StringTable owns the arena and stores references to that data inside other members of the StringTable, it is a self-referencing data structure, which is generally unsafe. This uses ouroboros::self_referencing to have a safe abstraction for this pattern.

Motivation

This produces a better memory layout with fewer system allocations. It likely reduces the total memory used, and it may improve the amount of memory returned to the OS when arena chunks are dropped (see below).

Additional Notes

The initial size of the arena is 16384 bytes aka 16KiB. It reserves a few bytes for itself. When it's full, it creates another chunk that seems at least as big as the previous chunk. This relieves pressure on the system allocator with string-by-string allocation, and reduces fragmentation.

This StringTable is copied and simplified from the PHP profiler. I intend for the PHP profiler to use this version of the StringTable going forward.

How to test the change?

This is an internal change only. Nothing special needs to be done for testing.

@github-actions github-actions Bot added the profiling Relates to the profiling* modules. label Aug 22, 2023
@morrisonlevi morrisonlevi marked this pull request as ready for review August 22, 2023 18:36
@morrisonlevi morrisonlevi requested review from a team as code owners August 22, 2023 18:36
danielsn
danielsn previously approved these changes Aug 22, 2023
Copy link
Copy Markdown
Contributor

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread profiling/src/profile/internal/string_table.rs Outdated
Comment thread profiling/src/profile/internal/string_table.rs Outdated
Comment thread profiling/src/profile/internal/string_table.rs Outdated
@ivoanjo
Copy link
Copy Markdown
Member

ivoanjo commented Aug 23, 2023

I've re-ran some of the experiments I mentioned in the memory usage RFC doc:

examples/ffi/profiles.c:

libdatadog version RSS Before drop (task manager) RSS After drop (task manager) Maximum RSS
v3.0.0 2.2 GB 1.1 GB 2181076 KB
main@62e329fd4ddb7a34 (2023-08-23) 1.7 GB 640.2 MB 1747392 KB
levi/string-table 1.8 GB 640.3 MB 1748420 KB

ruby_overhead_experiment.rb:

(config: {:threads=>16, :depth=>50, :seconds=>60, :timeline_enabled=>true})

libdatadog version Before serialization rss After serialization rss After GC rss ('shrinking', in the script)
main@62e329fd4ddb7a34 (2023-08-23) 78484k 134068k 134068k
levi/string-table 78732k 134352k 85504k

I think overall the profiles.c benchmark doesn't gain a lot since it doesn't contain that many strings, so most of the overhead is elsewhere.

The Ruby benchmark does get a reduced rss after GC, which does seem to match our expectation that more memory is able to be returned to the OS (although there's still a lot hanging around -- we're still using up more rss than even before serialization; if we were releasing all libdatadog memory we would get below that number).

@ivoanjo
Copy link
Copy Markdown
Member

ivoanjo commented Aug 25, 2023

After yesterday's discussion of rss impact, I got curious on what we'd see for examples/ffi/profiles.c with tcmalloc and jemalloc, so here's the results:

libdatadog version RSS Before drop (task manager) RSS After drop (task manager) Maximum RSS
levi/string-table glibc 1.8 GB 640.3 MB 1748420 KB
levi/string-table tcmalloc 1.7 GB 1.7 GB 1656732 KB
levi/string-table jemalloc 1.3 GB 165.9 MB 1299504 KB

@morrisonlevi morrisonlevi changed the base branch from main to levi/collections August 29, 2023 02:55
Base automatically changed from levi/collections to main August 31, 2023 17:09
@morrisonlevi morrisonlevi marked this pull request as draft August 31, 2023 17:10
@morrisonlevi morrisonlevi force-pushed the levi/string-table branch 2 times, most recently from 0170833 to 9cfe7d6 Compare September 26, 2023 03:27
@ivoanjo
Copy link
Copy Markdown
Member

ivoanjo commented Sep 27, 2023

I re-ran the Ruby test script with the latest version of this branch (6489a16):

ruby_overhead_experiment.rb:

(config: {:threads=>16, :depth=>50, :seconds=>60, :timeline_enabled=>true})

libdatadog version Before serialization rss After serialization rss After GC rss ('shrinking', in the script) Maximum RSS
main@8712cad31ce4d3 (2023-09-26) 78740k 77004k 77004k 79540k
levi/string-table 78568k 76964k 76964k 79528k

...Unfortunately, I think the current test scripts are not particularly suited to measure the difference in this PR, since both examples/ffi/profiles.c as well as the Ruby example repeat the same stacks again and again, which means very few unique strings.

Maybe the replayer with a realistic complex profile with lots of unique strings may be better here?

@morrisonlevi morrisonlevi marked this pull request as ready for review September 27, 2023 14:10
Copy link
Copy Markdown
Contributor

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, one worry about possible use-after-free

Comment thread profiling/src/collections/identifiable/string_table.rs
Comment thread profiling/src/collections/identifiable/string_table.rs
Comment thread profiling/src/collections/identifiable/string_table.rs
/// don't currently have metrics on this.
///
/// So... for now, the selected number of pages is arbitrarily chosen.
pub const GOOD_INITIAL_CAPACITY: usize = 8 * Self::PAGE_SIZE - Self::BUMP_OVERHEAD;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a tunable parameter to enable easier experimentation? How much does it matter?

Comment thread profiling/src/internal/profile.rs
Comment thread profiling/src/pprof/sliced_proto.rs Outdated
This uses `bumpalo::Bump` as an arena allocator to store the string
data contiguously with fewer calls to the system allocator.

Since the `StringTable` owns the arena and stores references to that
data inside other members of the `StringTable`, it is a self-
referencing data structure, which is generally unsafe. This uses
`ouroboros::self_referencing` to have a safe abstraction for this
pattern.
@morrisonlevi
Copy link
Copy Markdown
Contributor Author

On 50 runs of the replayer using the PHP timeline symfony-demo pprof, the average memory delta was:

  • main: 800.72
  • levi/string-table: 792.14

This seems within the noise tolerance of each other.

Copy link
Copy Markdown
Contributor

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One ⛏️ about adding a comment explaining why we iterate over strings differently than everything else.

Comment thread profiling/src/collections/identifiable/string_table.rs
}

for item in self.strings.into_iter() {
for item in self.strings.iter() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one case where we don't into_iter. Might be worth a comment explaining why


/// Returns an iterator over the strings in the table. The items are
/// returned in the order they were inserted, matching the [StringId]s.
pub fn iter(&self) -> impl Iterator<Item = &str> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, and might be tricky with lifetimes but this got me wondering if we can have an into_iter implementation that drops the underlying storage when the iteration is done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not worth it, IMO, because we'd have to make another self-referencing struct, and it doesn't actually buy us anything (end result is the same).

@ivoanjo
Copy link
Copy Markdown
Member

ivoanjo commented May 1, 2024

This is superceded by #404, closing! :)

@ivoanjo ivoanjo closed this May 1, 2024
@morrisonlevi morrisonlevi deleted the levi/string-table branch October 17, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

profiling Relates to the profiling* modules.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants