-
Notifications
You must be signed in to change notification settings - Fork 15
[PROF-9476] Add experimental profiling managed string storage #725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9476] Add experimental profiling managed string storage #725
Conversation
BenchmarksComparisonBenchmark execution time: 2025-01-20 11:57:04 Comparing candidate commit 022e6d8 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 52 metrics, 2 unstable metrics. CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
Group 13
BaselineOmitted due to size. |
| #[must_use] | ||
| #[no_mangle] | ||
| /// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do? | ||
| pub unsafe extern "C" fn ddog_prof_ManagedStringStorage_advance_gen( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this function ?
It's not clear from the name what advance gen means. Does it mean increase the reference counter/generation number so it's not collected ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate some more exposition here too. This basically seems like a re-implementation of reference counted strings? Do we not use those std lib types for some specific reason? Edit: I mean internally. Obviously Rust "references" and lifetimes cannot be accurately tracked across FFI. Maybe they just provide the requisite API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent question. Quoting Alex's original notes:
String storage cleanup in the PoC is based on usage counts.
The initial implementation was not based on usage counts but on
last_usage_gen: if we didn't use a string while building the current profile, we can drop it for the next profile. This led to crashes when we implemented the optimization to not report objects with age == 0 though (the interned strings associated with those objects would only be used after a GC but if a profile flush occurred in between the 2 events, the interned strings would become invalidated).A trivial change to that would have been to only clean up when unused for x generations. But this felt brittle (what if we have a usecase where we'll skip objects with age < 10) and memory wasteful.
So yeah the original intent was to automatically clean up unused strings based on "was this used in the last profile or not".
This (almost) matches really well with heap profiling, because heap profiling is all about repeating a sample on every profile until the object gets collected, so this was a somewhat natural fit.
But, as Alex pointed out, this becomes a bit thornier as we have an optimization on the Ruby heap profiler to not report objects that haven't at least survived a single GC cycle.
Thus, in the current version, this is purely done based on the caller doing all the tracking work, rather than the generational approach.
I'd say the code is a bit... weird right now because it's not very confident on what to do here. So here's my questions to y'all:
- Does the purely reference-counted mechanism work for you?
- Would the usage-in-generation one work for you?
I'm thinking we could even easily support both (as a setting when the managed string table gets constructed), but I don't want to get too feature-happy if nobody's interested in it yet.
(I think once we settle this, it does make sense to re-examine if we can clean up the code as much as possible, as Levi pointed out -- I'm happy to have as little custom fancy stuff as possible)
|
Forgot to comment sooner, but I tried out using this in the Python profiler. Commit here, very WIP because I was trying out other approaches as well at the time. The biggest issue I ran into was around forking. If the program forks while the |
Yeap, this is a good point. I think even without the lock, it's probably not a great idea to do anything other than throw away the table after the fork. (It would, on the other hand, be cool to have a lock-free implementation for this, but I fear the cost would probably not be great). Would having a way to clear the table ignoring the lock work as a temporary solution for Python? How do y'all handle this issue for the regular profile data structures? Reset on fork as well? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #725 +/- ##
==========================================
- Coverage 71.28% 70.72% -0.57%
==========================================
Files 319 321 +2
Lines 46871 47300 +429
==========================================
+ Hits 33414 33453 +39
- Misses 13457 13847 +390
|
|
Would everyone agree with merging this ?
|
Yeah. I followed up on this with @ivoanjo on slack but forgot to update here. It turns out we don't need this right now for Python so nothing about this needs to change on our account :) |
| } | ||
|
|
||
| #[no_mangle] | ||
| /// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this is something even in C I go back and forth on. It's one of those "do I trust the user or do I be more defensive?" things. Setting the C pointer to null makes it easier to debug when things go wrong, and can sometimes even prevent further things from going wrong. Sometimes it also makes it harder to debug because you don't get a use-after-free warning from ASAN, so that can swing both ways. But it's theoretically wholly wasted work because nobody should use the thing after it's been dropped...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the mut option I think is nice since it means we're often returning an error back on the api calls wrongly, and since the client should handle those anyway, it means we're turning something that's a definitively a bug into a nice error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't have the "answer" to this one. All I can say is that I made the others re-assign the pointer null on drop when it was feasible, so I clearly thought at the time that it was a good idea.
|
|
||
| #[must_use] | ||
| #[no_mangle] | ||
| /// TODO: Consider having a variant of intern (and unintern?) that takes an array as input, instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a use case for this in your PoC for Ruby? If so, I'd do it, and if not, I'd pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do! We're literally interning in a loop when I need to consume a whole stack. (Note: intern_or_raise here is just a nice helper to call ddog_prof_ManagedStringStorage_intern and check if there's an error in the result)
heap_stack* heap_stack_new(heap_recorder *recorder, ddog_prof_Slice_Location locations) {
uint16_t frames_len = locations.len;
// ...some error checking...
heap_stack *stack = ruby_xcalloc(1, sizeof(heap_stack) + frames_len * sizeof(heap_frame));
stack->frames_len = frames_len;
for (uint16_t i = 0; i < stack->frames_len; i++) {
const ddog_prof_Location *location = &locations.ptr[i];
stack->frames[i] = (heap_frame) {
.name = intern_or_raise(recorder->string_storage, location->function.name),
.filename = intern_or_raise(recorder->string_storage, location->function.filename),
// ddog_prof_Location is a int64_t. We don't expect to have to profile files with more than
// 2M lines so this cast should be fairly safe?
.line = (int32_t) location->line,
};
}
return stack;
}(from my working branch which is based off of DataDog/dd-trace-rb#3628 ).
My thinking is that, unlike most other libdatadog APIs where either a) Are very small but we don't call them very often (e.g. setup and reporting); b) We do a big chunk of work on every call (profile add), this API does c) Both very little work and gets called many times.
Thus, it seems like a prime candidate to turn C into B -> by having a more coarse-grained call that lowers the overhead cost of the ffi and locking.
| cached_seq_num_for: Cell<Option<*const StringTable>>, | ||
| cached_seq_num: Cell<Option<StringId>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are storing the cache on each string, but aren't these all added in a batch to the same string table? I think you said that you add all these managed strings to the Profile's string table just before serialization, right? Couldn't we perform a larger batch operation and store the cache there? That way the memory is only used on serialization rather than kept around but largely not being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I as looking at this, I did a small tweak to store these as a tuple, rather than as separate entries, as they're related anyway -- 1f2b953 .
Your suggestion is interesting, but I'm curious how far were you thinking about the "large batch operation". In particular, were you thinking of moving the cache entirely away from the string table, to the profile? Or even to the caller of the profile?
morrisonlevi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite finished but publishing what I have.
morrisonlevi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still considered "experimental"? What's the rough plan to either revert it or make it non-experimental? Asking mostly because it it does intrude, albeit only a little, onto the FFI structs of the "main" thing.
Approved in general, I've blocked this long enough I think ^_^
| } | ||
|
|
||
| #[no_mangle] | ||
| /// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't have the "answer" to this one. All I can say is that I made the others re-assign the pointer null on drop when it was feasible, so I clearly thought at the time that it was a good idea.
Good question. I guess my current thinking about calling it "experimental" is because it's a feature we may end up iterating a lot on, including breaking APIs and whatnot, so if anyone's building on top of this, I was trying to get that across. Since the early benchmarks we had for Ruby showed good results for this I wouldn't expect this to be entirely reverted; at most maybe we could throw it in a Ruby-specific folder if it turns out this use-case is too specific and nobody else needs it.
My thinking is that it's still harmless, even if ffi callers don't properly zero out those fields, for two reasons:
To be honest I'm somewhat more annoyed about the duplication in the profile implementation, but there, because I think the right solution is to have separate structures (thus enforcing that you either have ids OR strings), the only way of avoiding the duplication would be to introduce a bunch of macros to "hide" the fact that we'd have two versions of the code with only the very minor differences. (At least given the current API....)
I'd say:
cc @danielsn any concerns with me going ahead and merging the PR? |
| /// TODO: Consider having a variant of intern (and unintern?) that takes an array as input, instead | ||
| /// of just a single string at a time. | ||
| /// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do? | ||
| pub unsafe extern "C" fn ddog_prof_ManagedStringStorage_intern( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these have #safety comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure they'd be especially helpful here.
In particular, the only comment I can think of is "the charslice needs to be valid or empty, and the managed string storage needs to be valid or null", which... IDK... seems to describe every function in our ffi that takes a pointer? 👀
This will later allow us to introduce the new StringId code without disturbing existing API clients. This duplication is intended to be only an intermediate step to introducing support for StringId.
…called with id 0 This is much nicer than having a weird panic in there.
…safer With the current structure of the code, the `expect` inside `resolve` should never fail; hopefully we don't introduce a bug in the future that changes this. (I know that ideally in Rust we would represent such constraints in the type system, but I don't think I could do so without a lot of other changes, so I decided to go for the more self-contained solution for now.)
In particular, in the unlikely event that we would overflow the id, we signal an error back to the caller rather than impacting the application. The caller is expected to stop using this string table and create a new one instead. In the future we may make it easy to do so, by e.g. having an API to create a new table from the existing strings or something like that.
This will enable us to propagate failures when a ManagedStringId is not known, which will improve debugability and code quality by allowing us to signal the error.
This string is supposed to live for as long as the managed string storage does. Treating it specially in intern matches what we do in other functions and ensures that we can never overflow the reference count (or something weird like that).
Adding it as `pub` was an oversight, since the intent is for this to be an inner helper that's only used by `intern` and by `new`. Having this as `pub` is quite dangerous as this method can easily be used to break a lot of the assumptions for the string storage.
There's currently nothing that can fail in this conversion, so let's take advantage of this in the code. (The `TryFrom` was somewhat of a leftover from copy/pasting these conversion functions from the variants that did need to deal with Strings, but in the id variants we can simplify).
This should be safer than the existing helper, since according to Levi it doesn't rely on the string being null-terminated (and I'm guessing doesn't need to measure it either).
I spotted during code review that this was incorrect -- `unintern` is a mutable operation on the managed string table (it decreases the refcount of items) so the write lock must be used for it.
Not sure if this was there from the beginning or the result of successive refactors, but yeah, we were unpacking and repacking the `ManagedStringId`s uselessly (the input and output types were the same!). I'm pretty sure the compiler would optimize all of this away -- in the end this is a struct with an int -- but our code sure does look better.
…ile_with_string_storage`
|
I think the commit history for this PR is worth preserving. I'm preparing to merge this to master with a regular merge commit (not with a squash). As part of it, I'll push force a rebase so that there's no more "merge from master" commits. |
8de5289 to
022e6d8
Compare
What does this PR do?
This PR builds on the work started by @AlexJF on #607 to introduce "managed string storage" for profiling.
The idea is to introduce another level of string storage for profiling that is decoupled in lifetime from individual profiles, and that is managed by the libdatadog client.
At its core, managed string storage provides a hashtable that stores strings and returns ids. These ids can then be provided to libdatadog instead of
CharSlices when recording profiling samples.For FFI users, this PR adds the following APIs to manage strings:
ddog_prof_ManagedStringStorage_newddog_prof_ManagedStringStorage_intern(String)ddog_prof_ManagedStringStorage_unintern(id)ddog_prof_ManagedStringStorage_advance_genddog_prof_ManagedStringStorage_dropddog_prof_ManagedStringStorage_get_stringA key detail of the current implementation is that each
interncall with the same string will increase an internal usage counter, anduninterncall with reduce the counter.Then at
advance_gentime, if the counter is zero, we get rid of the string.Then to interact with profiles, there's a new
ddog_prof_Profile_with_string_storageAPI to create a profile with a givenManagedStringStorage, and all structures that make up aSample(Mapping,Function,Label) etc have been extended so that they either take aCharSliceor aManagedStringId.Thus, after interning all strings for a sample, it's possible to add a sample to a profile entirely by referencing strings by ids, rather than
CharSlices.Motivation
The initial use-case is to support heap profiling -- "samples" related to heap profiling usually live across multiple profiles (as long as a given object is alive) and so this data must be kept somewhere.
Previously for Ruby we were keeping this on the Ruby profiler side, but having libdatadog manage this instead presents a few optimization opportunities.
We also hope to replace a few other "string tables" that other profilers had to build outside of libdatadog for similar use-cases.
This topic was also discussed in the following two documents (Datadog-only, sorry!):
Additional Notes
In keeping with the experimental nature of this feature, I've tried really hard to not disturb existing profiling API users with the new changes.
That is -- I was going for, if you're not using managed string storage, you should NOT be affected AT ALL by it -- be it API changes or overhead.
(This is why on the pure-Rust profiling crate side, I ended up duplicating a bunch of structures and functions. I couldn't think of a great way to not disturb existing API users other than introducing alternative methods, but to be honest the duplication is all in very simple methods so I don't think this substantially increases complexity/maintenance vs trying to be smarter to bend Rust to our will.)
There's probably a lot of improvements we can make, but with this PR I'm hoping to have something in a close to "good enough" state, that we can merge this in and then start iterating on master, rather than have this continue living in a branch for a lot longer.
This doesn't mean we shouldn't fix or improve things before merging, but I'll be trying to identify what needs to go in now and what can go in as separate, follow-up PRs.
As an addendum, there's still a bunch of
expects sprinkled that should be turned into proper errors. I plan to do a pass on all of those. (But again, none of the panics affect existing code, so they're harmless and inert unless you're experimenting with the new APIs)How to test the change?
The branch in https://github.com/DataDog/dd-trace-rb/tree/ivoanjo/prof-9476-managed-string-storage-try2 is where I'm testing the changes on the Ruby profiler side.
It may not be entirely up-to-date with the latest ffi changes on the libdatadog side (I've been prettying up the API), but it shows how to use this concept, while passing all the profiling unit/integration tests, and has shown improvements in memory and latency in the reliability environment.