Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(profiling)!: add arena-based string table #306

Closed
wants to merge 53 commits into from

Conversation

morrisonlevi
Copy link
Contributor

@morrisonlevi morrisonlevi commented Feb 7, 2024

What does this PR do?

Creates a string table that stores the bytes of the strings into an arena, instead of using individual allocations. The arena has a maximum capacity, and uses virtual memory to acquire pages lazily.

This is a breaking change:

  1. Many routines in Rust which weren't fallible now are to account for the string table hitting its capacity, or if initial allocations fail.
  2. The minimum string arena capacity in bytes must be passed to ddog_prof_Profile_new.

Motivation

I'm working on reducing overhead of the PHP profiler, and reducing memory of the string table is a nice step. It also puts less pressure onto the system allocator, which should help others like the Ruby profiler as well.

Additional Notes

The PR is quite large. Maybe I should split it into two pieces, but it's kind of hard to know where to split it.

PROF-9125

How to test the change?

The function ddog_prof_Profile_new has a new parameter for the minimum size of the arena, which will round up to the nearest page size for you. The memory is virtually allocated, so it will be paged in by the OS as needed. You don't need to worry too much about its size, as long as you are being reasonable (e.g. 2 GiB may be fine, 2 TiB is not).

For Reviewers

  • If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

@github-actions github-actions bot added the profiling Relates to the profiling* modules. label Feb 7, 2024
@morrisonlevi morrisonlevi force-pushed the levi/arena-string-table branch 8 times, most recently from 9a650e3 to eb600be Compare February 8, 2024 15:21
@morrisonlevi
Copy link
Contributor Author

morrisonlevi commented Feb 9, 2024

In PHP, I got a 3-4% memory savings, ~2.22MiB, on the containers I tested. I'm happy with this number for PHP and look forward to hearing numbers from any other runtime which can test it.

Base automatically changed from levi/box-str to main February 13, 2024 15:20
/// and must have the correct number of elements for the slice.
#[no_mangle]
#[must_use]
pub unsafe extern "C" fn ddog_prof_Profile_new(
sample_types: Slice<ValueType>,
period: Option<&Period>,
start_time: Option<&Timespec>,
string_arena_min_capacity: libc::size_t,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to expose this to the client libraries ? I feel like that should be internal.
To be honest, when I create a profile object I do not know the minimal.
I feel like it's worth hiding it (since it's a technical detail)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exposed because every runtime has different ideas about how much memory usage is okay. You making this comment made me realize that "min_capacity" is not the best name: it rounds up to a page size, and if you exceed that it will fail. So this number should be close to the maximum you want. It's lazily allocated, so you don't need to worry too much about the memory unless you actually use it, as long as your request is sane (you know, like 1 GiB, not 1 TiB).

Copy link
Contributor

@gleocadie gleocadie Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exposed because every runtime has different ideas about how much memory usage is okay.

Hum I see, but for .NET I do not know what would be a good value.
It will push libraries to use a big value (since it's lazy) and forget about it. Going that way, having it internal still makes sense.
If you still want to let the libraries to set this value, you could make it optional no?

ProfileNewResult::Ok(ffi_profile)
match internal::Profile::new(start_time, &types, period, string_arena_min_capacity) {
Ok(internal_profile) => ProfileNewResult::Ok(Profile::new(internal_profile)),
Err(err) => ProfileNewResult::Err(Error::from(err.context("ddog_prof_Profile_new failed"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a test that actually fails no?

@@ -468,8 +469,13 @@ pub unsafe extern "C" fn ddog_prof_Profile_set_endpoint(
}
};
let endpoint = endpoint.to_utf8_lossy();
profile.add_endpoint(local_root_span_id, endpoint);
ProfileResult::Ok(true)
if let Err(err) = profile.add_endpoint(local_root_span_id, endpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about testing

@morrisonlevi morrisonlevi marked this pull request as ready for review April 1, 2024 15:21
@morrisonlevi morrisonlevi requested review from a team as code owners April 1, 2024 15:21
This helps keep the APIs for various platforms in sync. It also helps
with testing, to ensure the code that handles virtual_alloc failures
actually works.
This is a more accurate name that suggests the exact value will not
necessarily be used.
@@ -884,6 +893,7 @@ mod test {
Slice::from_raw_parts(sample_type, 1),
None,
None,
4096,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests that vary this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically in FFI, no. There are tests on arena and string table for small values like 64, and there is a test on string table which does 4 * u16::MAX or in other words, a number just under 256 KiB.

But not currently testing other values in FFI.

@@ -151,7 +151,8 @@ fn main() -> anyhow::Result<()> {
replayer.start_time,
&replayer.sample_types,
replayer.period,
);
u32::MAX as usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to provide some value. A max of 4 GiB seemed okay for request replayer.

profiling/src/alloc/arena.rs Outdated Show resolved Hide resolved
let layout = layout.pad_to_align();
let size = layout.size();
if size == 0 {
return Ok(NonNull::from(&mut []));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not NonNull::dangling()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's a wide pointer NonNull<[u8]>, so we also need size info.

error[E0277]: the size for values of type `[u8]` cannot be known at compilation time
  --> profiling/src/alloc/arena.rs:70:23
   |
70 |             return Ok(NonNull::dangling());
   |                       ^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[u8]`
note: required by a bound in `NonNull::<T>::dangling`
  --> /Users/levi.morrison/.rustup/toolchains/1.69.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ptr/non_null.rs:67:6
   |
67 | impl<T: Sized> NonNull<T> {
   |      ^ required by this bound in `NonNull::<T>::dangling`

profiling/src/alloc/arena.rs Outdated Show resolved Hide resolved
// but in practice, it won't fail because it's rounded to a page size,
// and I've never seen pages that small, even for 16 bit. However, in
// any case, it should not panic, which is the point of the call.
_ = std::hint::black_box(arena.allocate_zeroed(Layout::new::<u8>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that we don't need a let before _


#[test]
fn test_arena_simple_alignment() -> anyhow::Result<()> {
const DISTANCE: usize = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that DISTANCE has the opposite meaning in this vs the previous test?

let third = arena.allocate_zeroed(pointer)?;

expect_distance(first, second, mem::size_of::<*const ()>());
expect_distance(second, third, mem::size_of::<*const ()>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also directly test that the pointers are aligned

profiling/src/alloc/arena.rs Outdated Show resolved Hide resolved
pub r#type: StringId,
pub unit: StringId,
pub(crate) struct ValueType {
pub r#type: (Option<LengthPrefixedStr>, StringId),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both of these?

use std::sync::Once;

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
pub struct AllocError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want opaque or discriminated errors on this API

Comment on lines +85 to +94
static INIT: Once = Once::new();
static mut PAGE_SIZE: usize = 0;

unsafe {
INIT.call_once(|| {
PAGE_SIZE = os::page_size().unwrap();
});
PAGE_SIZE
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a lazy_static?

Comment on lines 9 to 11
/// The fatptr must have been previously allocated by [virtual_alloc] by
/// this allocator, and must have the same address and length as it was
/// returned with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And must not have been previously freed

/// The language infers !Send because the value types have length-prefixed
/// strings inside of them, and these are not Send by themselves. However, if
/// the whole profile is moved, including the string table, then this is safe.
unsafe impl Send for Profile {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have code that actually takes advantage of this?

unit: profile.intern(vt.unit),
.map(|vt| -> anyhow::Result<ValueType> {
Ok(ValueType {
r#type: profile.strings.insert_full(vt.r#type)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going on here

let arena = self.strings.arena();
api::Period {
r#type: api::ValueType {
r#type: arena.fetch(t.1.r#type.0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment explaining what's going on here

Comment on lines +324 to +325
let mut iter = self.strings.into_iter();
while let Some(item) = iter.next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use the for loop? Does into_iter not actually give an iter here?

#[inline]
fn intern<S>(&mut self, item: &S) -> anyhow::Result<StringId>
where
S: ?Sized + Borrow<str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?


let period = (
10_000_000,
ValueType {
r#type: profile.intern("wall-time"),
unit: profile.intern("nanoseconds"),
r#type: profile.strings.insert_full("wall-time")?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of the weird type we have for valuetypes

@danielsn
Copy link
Contributor

danielsn commented Apr 1, 2024

What happens if HUGE_PAGES are enabled?

@@ -133,7 +143,8 @@ impl Profile {
start_time: SystemTime,
sample_types: &[api::ValueType],
period: Option<api::Period>,
) -> Self {
string_arena_capacity_hint: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the ffi, it may be worth specifying the unit here is in bytes maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a hint, or an actual max ?
Do we need to put a big value and assume that only when we start using them will there be a RSS usage ?
There seems to be some concerns around providing accurate hints here.

Copy link
Contributor Author

@morrisonlevi morrisonlevi Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets rounded to a page size. There's mostly no consequence for setting this high, because it's virtual memory so you pay for what you actually use. There are some edges though, like on Linux, OOM kill will be more attracted to your process if you have high virtual memory use. And .NET supports 32-bit processes which only has 2 GiB of virtual memory available per process.

I met with Christophe and Gregory and they are concerned about hitting a max, any max, because that will screw up the accuracy of the upscaling, and they would consider this inaccuracy intolerable in .NET. I knew that the data would incomplete, and that may have been okay. But it will also be inaccurate, and that's not.

Making a more complicated and advanced allocator is a possibility, but then the risk surface goes up too. I'm currently exploring an arena which has fixed-sized chunks (maybe 1-2 MiB in size), and it allocates new chunks as needed. It uses the system allocator to hold these in a Vec, which isn't ideal, but that's far less memory than is represented by the chunks themselves, so it would still greatly reduce the effects of commingling with the system allocator. I could avoid this by using an intrusive linked-list, but this slows performance and makes it harder to provide a global offset for it. Anyway, this is the direction I'm looking at right now, while also searching for crates that may do something we want when I think of a new search term to maybe find new ones.

@morrisonlevi
Copy link
Contributor Author

Closing in favor of #404.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-build profiling Relates to the profiling* modules.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants