-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923
[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, as far as my limitations go of knowing Ruby C APIs go. I didn't check that all the Check_Type
calls are right, or that UINT2NUM
values are in range or anything like that.
Is there anything in particular you want me to look at?
|
||
static VALUE create_exporter(struct ddprof_ffi_EndpointV3 endpoint, VALUE tags_as_array) { | ||
long tags_count = rb_array_len(tags_as_array); | ||
ddprof_ffi_Tag converted_tags[tags_count]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Ruby have macros/functions for optionally using alloca
and falling back to malloc
depending on the size and if the platform supports it?
PHP has these macros, and it's useful for cases like this where it should be small enough to put on the stack but in case someone passes a large array you are protected against blowing your stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's... a great point, thanks! I dug around the documentation and the VM headers and there are a few things there (undocumented, of course!) but nothing that seemed a really good fit.
So in 65cbe6d I just changed this to a heap allocation. Shrug emoji :)
ddprof_ffi_SendResult result; | ||
}; | ||
|
||
inline static ddprof_ffi_ByteSlice byte_slice_from_ruby_string(VALUE string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, you can probably drop inline
here; in this case it wouldn't do anything different than just static
. The only exception I'm aware is when you mix inline static
and plain inline
(no static) functions, then sometimes the compiler will yell at you. It's not important, just a bit of a C semantic you may or may not know about :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL! I expect at some point this will move to a shared file, and I'll drop the static
part, so I'm thinking of leaving it for now more as a declaration of intention than expecting different semantics.
Nice link discussing this as well -> https://stackoverflow.com/questions/7762731/whats-the-difference-between-static-and-static-inline-function .
Check_Type(tag_name, T_STRING); | ||
Check_Type(tag_value, T_STRING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if Check_Type
fails? I see it "raises" an exception but how is that technically implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what you might've suspected: our friends setjmp and longjmp :)
Control will jump to a Ruby VM-controlled exception handler.
Thanks for the great inputs so far! The Ruby VM functions usually are safe -- they internally check invariants and throw exceptions if things aren't as intended.
Yes -- if I'm using libddprof in a sane way, especially all the resource cleanup related comments. Am I releasing the things that should be released, and correctly ignoring the things that libddprof takes ownership of -- including in the failure cases? |
8ae9ea1
to
da925a0
Compare
|
||
inline static ddprof_ffi_ByteSlice byte_slice_from_ruby_string(VALUE string) { | ||
Check_Type(string, T_STRING); | ||
ddprof_ffi_ByteSlice byte_slice = {.ptr = (uint8_t *) StringValuePtr(string), .len = RSTRING_LEN(string)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the memory pointed by the pointer returned by StringValuePtr
managed? Does it point to a GCed memory area, or is it a raw pointer? If a raw pointer, how do we make sure it doesn't leak (especially regarding the use of rb_raise
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a raw pointer to the actual char *
that Ruby strings keep. So it has the same lifetime as the Ruby object, and we don't need to clean it up, since the Ruby GC will manage its lifetime (when the object gets collected, the string will get free'd).
On the other hand, we do have to be careful around concurrency, e.g. we must never manipulate Ruby strings if we have released the Global VM Lock, because Ruby may decide to move/change/collect them concurrently.
struct call_exporter_without_gvl_arguments args = {.exporter = exporter, .request = request}; | ||
// TODO: We don't provide a function to interrupt reporting, which means this thread will be blocked until | ||
// call_exporter_without_gvl returns. | ||
rb_thread_call_without_gvl(call_exporter_without_gvl, &args, NULL, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the memory pointed by StringValuePtr when the GVL isn't held? Can a GC run and move things around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can clean it up (if we don't keep a GC-visible reference to it), move it, etc.
This is why we can't release the GVL until after we build the libddprof request. Only afterwards, when libddprof has copied stuff to its own internal representation can we release the GVL.
VALUE code_provenance_data, | ||
VALUE tags_as_array | ||
) { | ||
ddprof_ffi_NewProfileExporterV3Result exporter_result = create_exporter(exporter_configuration, tags_as_array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to build a new exporter because we can't yet set tags after the creation of the exporter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was one of the reasons I did it (see 5d12ae6 ) although that is no longer the case. Quoting from that commit description:
Reusing the libddprof exporter has a number of disadvantages:
We need to have quite a bit of boilerplate to be able to represent
them as Ruby objectsTags are set at exporter creation, so we would need to somehow
tackle the issue of tags such as runtime-id and pid that can change
during the lifetime of a process
(⬆️ This has been fixed by Levi in recent versions of libddprof)
- There were some concerns around libddprof and Ruby apps that fork,
although we didn't look into them in much detailAfter benchmarking both approaches using the
benchmarks/profiler_http_transport.rb
benchmark, the performance
of anHttpTransporter
that does not reuse the exporter is more
than acceptable. I measured > 1000 reports per second on my laptop;
and the expected usage for this class is one report per minute, so
I think we're good.Thus, let's refactor
HttpTransport
to create a new exporter
every time we report:
This allows the
tags
to change from report to report (and thus
they move back to be part of theFlush
, as they were in the
pre-HttpTransport
times)We eliminate the need for the
HttpTransport::Exporter
class that
wrapped the native libddprof pointer.No need to care about potential issues with forks.
I'd say the main advantage is the fork-safety -- Ruby apps can and do fork, at any time, so we could always leave an exporter in a weird internal state, so it seems simpler and less error-prone to not reuse it.
da925a0
to
be6fb26
Compare
c85d4e3
to
59637a9
Compare
I had previously created #1936 which included a bunch of extra fixes that hadn't made it to this PR. While revisiting these PRs, I don't think having #1936 separate from this PR makes sense anymore (it accumulated too many useful fixes), so I've added the commits from it to this PR, and closed that PR. |
00280c5
to
e10a0e9
Compare
HttpTransport
class to report profiles through libddproflibbdprof
gem as dependency + Add HttpTransport
class to report profiles through libddprof
libbdprof
gem as dependency + Add HttpTransport
class to report profiles through libddprof
I've rebased this PR on top of current master, as well as added a few fixes to make sure it's in a shippable state :) |
df58eb4
to
81ff330
Compare
(The extraction originally came from `http_transport.c`, added in #1923 but after a rebase only the addition of the new header remained)
…dprof One of the features provided by `libddprof` is reporting profiles. The `HttpTransport` class exposes this functionality to Ruby code, and will in the future replace most of the code in `lib/datadog/profiling/transport/. This commit and PR only adds the `HttpTransport` without wiring it up; that will come in a separate PR.
The number of tags is variable, and we expect not that big, but the failure sematics of stack allocation are completely bonkers so let's just switch this to a regular heap allocation.
…ise error In #1932 we did the groundwork required to support parsing the unix domain socket configuration from a c.tracing.transport_options configuration. Thus, we can now support unix domain sockets configured by the user without any other change (assuming that PR is merged before this commit is). The `transport_configuration_proc` can still contain weird transport configurations, but if that's the case, we just ignore them and print a warning. This is because we don't support every setting a user may provide in that proc. Worst case, for customers with a really custom config, is that the profiler won't be able to report data, and the customer will need to reach out to support for help updating it. (But we won't raise an exception or stop the profiler from starting).
Code provenance collection can be disabled, and in that case should not be reported. This was already a use case supported by the previous transport code, but it was missed when `HttpTransport` got created.
See DataDog/libddprof#33 for details.
This benchmark can be used to evaluate the performance and memory correctness of the `HttpTransport` class. In particular, it was used to fix a memory leak (and to validate that we don't have others), and was used to validate the change where `HttpTransport` creates a new exporter for every report, rather than reusing exporters.
Reusing the libddprof exporter has a number of disadvantages: * We need to have quite a bit of boilerplate to be able to represent them as Ruby objects * Tags are set at exporter creation, so we would need to somehow tackle the issue of tags such as runtime-id and pid that can change during the lifetime of a process * There were some concerns around libddprof and Ruby apps that fork, although we didn't look into them in much detail After benchmarking both approaches using the `benchmarks/profiler_http_transport.rb` benchmark, the performance of an `HttpTransporter` that does not reuse the exporter is more than acceptable. I measured > 1000 reports per second on my laptop; and the expected usage for this class is one report per minute, so I think we're good. Thus, let's refactor `HttpTransport` to create a new exporter every time we report: * This allows the `tags` to change from report to report (and thus they move back to be part of the `Flush`, as they were in the pre-`HttpTransport` times) * We eliminate the need for the `HttpTransport::Exporter` class that wrapped the native libddprof pointer. * No need to care about potential issues with forks.
Because we are using `rb_thread_call_without_gvl2`, we need to be careful not to assume that our code ran at all after it returns.
Previously we were sending all webrick logs to a `StringIO` that was never checked, which meant that if something was wrong inside webrick, we would not know about it (since webrick runs on a background thread). Instead, I've configured webrick to log to stderr, and raised the log level so only important issues get printed.
The really long timeout may make a failing test hang, whereas with a shorter timeout libddprof will just fail the request and the test will finish with an error.
There is a bug in webrick in Ruby 2.2 when used with `DoNotListen: true` which broke in our tests. I've added a workaround for it.
Whenever something is dynamically allocated, we must ensure that either no exceptions can be thrown or that we clean it up before throwing an exception.
This is becoming a common-enough operation that it makes sense to have a nice wrapper for it.
…ddprof objects The previous version of the code went: 1. Create libddprof exporter 2. Convert Ruby VALUE objects into libddprof equivalents 3. Fire off the request In the happy path, this worked fine. But if in step 2 any exception was raised (because some object was `nil` or of an unexpected type), then the exporter would be leaked. To fix this, I've flipped the code around so that the order now goes: 1. Convert Ruby VALUE objects into libddprof equivalents 2. Create libddprof exporter 3. Fire off the request Thus, we now do all of the work that may raise exceptions up front, and only then do we start calling libddprof APIs which allocate dynamic memory. (I did slightly move the code around to minimize the diff, and that's why `build_request` actually became `_native_do_export`, and the old code in `_native_do_export` became the new `perform_export`).
Make sure we clean up `tags` if logging failed with an exception.
It took me a bit to wrap my head around this one, but `rb_thread_call_without_gvl2` can return with an "interrupt". This interrupt may mean that the VM has some work to do (and wants us to call something like `rb_thread_check_ints()`), BUT that SHOULD NOT be a reason for the export to fail, as this is expected to happen from time to time. I initially missed this subtlety, and it manifested itself as flaky tests in CI (this seems to happen more often in Ruby 2.3 and below). Tests would randomly fail with the `Interrupted before call_exporter_without_gvl ran` message we previously had. After some experimentation, I found I could reproduce on any Ruby version, if I just did enough calls to export: ```ruby context 'when server returns a 5xx failure' do let(:server_proc) { proc { |_req, res| res.status = 503 } } it 'logs an error' do expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/) 10_000.times do |i| print '@' if i % 100 == 0 http_transport.export(flush) end end end ``` The fix here is to separate two different cases: * There's an interrupt and the VM wants to process it. This shouldn't fail the export -- we just let the VM do whatever and retry our call. * There's an exception to be raised. This will fail the export.
81ff330
to
4cd898e
Compare
I had held back on marking this PR as non-draft as I was seeing a few flaky tests related to releasing the GVL. I was able to track them down and fix them in 4cd898e, so this is ready to go! (I also rebased it on top of current master) |
// We're in a weird situation that libddprof doesn't quite support. The ddprof_ffi_Request payload is dynamically | ||
// allocated and needs to be freed, but libddprof doesn't have an API for dropping a request because it's kinda | ||
// weird that a profiler builds a request and then says "oh, nevermind" and throws it away. | ||
// We could add such an API, but I'm somewhat hesitant since this is a really bizarre corner case that looks quite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll included it in the next version. I was just noticing independently we don't have such a thing. How does ddprof_ffi_Request_drop
sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, that's nice! Yeah ddprof_ffi_Request_drop
works great for me, thanks!
Thanks y'all! Merging ahead! CI is red due to a lint issue that has been fixed in master, but I don't think it's worth chasing it down by merging master into this branch or rebasing just because rubocop wants an extra blank line somewhere. |
…of failing After fixing the VM interruptions issue for `HttpTransport` in 4cd898e (#1923), It looks to me that the same issue could happen to the `StackRecorder`. I tried to reproduce the issue in a similar way as we did for the `HttpTransport` but just running `serialize` in a loop did not trigger it. On a hunch, I decided to hack it into the `HttpTransport` specs (since `HttpTransport` seemed to be able to cause the VM to have pending interrupts) and I was able to see it fail, very rarely on Ruby 2.7: ```diff @@ -419,9 +419,13 @@ RSpec.describe Datadog::Profiling::HttpTransport do let(:server_proc) { proc { |_req, res| res.status = 503 } } it 'logs an error' do - expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/) + #expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/) - http_transport.export(flush) + 10_000.times do |i| + print '@' if i % 100 == 0 + Datadog::Profiling::StackRecorder.new.serialize + http_transport.export(flush) + end end end ``` TL;DR `rb_thread_call_without_gvl2` can return BEFORE running our code when the VM has a pending interruption, so we need to let the VM do its work, but there's no reason to fail the action (in this case, fail to serialize), if all the VM wants is to run some callback. So instead we retry running our code until it either runs or the VM raises an exception. The approach here is simpler than in `HttpTransport` because the heap allocations happen only inside `call_serialize_without_gvl`. In `HttpTransport` the code is more awkward because there are heap allocations done before calling `rb_thread_call_without_gvl2` and those need to be cleaned up as well.
…of failing After fixing the VM interruptions issue for `HttpTransport` in 4cd898e (#1923), I suspected that the same issue could happen to the `StackRecorder`. It took a few attempts to reproduce this issue, but the easiest way seems to be to "send ctrl+c to itself" by adding ```c #include <signal.h> kill(getpid(), SIGINT); ``` before the call to `rb_thread_call_without_gvl2`. This made the VM flag an interrupt, which as long as no exception was raised (and RSpec has its own interrupt handler) simulated the issue. TL;DR `rb_thread_call_without_gvl2` can return BEFORE running our code when the VM has a pending interruption, so we need to let the VM do its work, but there's no reason to fail the action (in this case, fail to serialize) if all the VM wants is to run some callback and then we can get on with our work. So instead we retry running our code until it either runs or the VM actually raises an exception. The approach here is simpler than in `HttpTransport` because the heap allocations happen only inside `call_serialize_without_gvl`; in `HttpTransport` there are heap allocations done before calling `rb_thread_call_without_gvl2` and thus a more complex approach is needed.
…tpTransport` class Everything in `lib/datadog/profiling/transport/` already existed prior to the migration to the libddprof-based `HttpTransport` class (#1923). Temporarily, and just in case there are unknown-unknowns with `HttpTransport`, I've re-imported a few classes, and modified them so they will be able to be used **instead of** `Datadog::Profiling::HttpTransport`. This is only temporary, and will be deleted/reverted later.
…tpTransport` class Everything in `lib/datadog/profiling/transport/` already existed prior to the migration to the libddprof-based `HttpTransport` class (#1923). Temporarily, and just in case there are unknown-unknowns with `HttpTransport`, I've re-imported a few classes, and modified them so they will be able to be used **instead of** `Datadog::Profiling::HttpTransport`. This is only temporary, and will be deleted/reverted later.
One of the features provided by
libddprof
is an http/https client for reporting profiles.The
HttpTransport
class exposes this functionality to Ruby code, and will in the future replace most of the code inlib/datadog/profiling/transport/
.This PR only adds the
HttpTransport
class without wiring it up; that will come in a separate PR.Because the
HttpTransport
class is not wired up, this is not expected to have any observable change to customers (and thus no impact).