[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923

ivoanjo · 2022-03-01T12:11:59Z

One of the features provided by libddprof is an http/https client for reporting profiles.

The HttpTransport class exposes this functionality to Ruby code, and will in the future replace most of the code in lib/datadog/profiling/transport/.

This PR only adds the HttpTransport class without wiring it up; that will come in a separate PR.

Because the HttpTransport class is not wired up, this is not expected to have any observable change to customers (and thus no impact).

morrisonlevi

Looks good to me, as far as my limitations go of knowing Ruby C APIs go. I didn't check that all the Check_Type calls are right, or that UINT2NUM values are in range or anything like that.

Is there anything in particular you want me to look at?

morrisonlevi · 2022-03-01T15:24:04Z

ext/ddtrace_profiling_native_extension/http_transport.c

+
+static VALUE create_exporter(struct ddprof_ffi_EndpointV3 endpoint, VALUE tags_as_array) {
+  long tags_count = rb_array_len(tags_as_array);
+  ddprof_ffi_Tag converted_tags[tags_count];


Does Ruby have macros/functions for optionally using alloca and falling back to malloc depending on the size and if the platform supports it?

PHP has these macros, and it's useful for cases like this where it should be small enough to put on the stack but in case someone passes a large array you are protected against blowing your stack.

That's... a great point, thanks! I dug around the documentation and the VM headers and there are a few things there (undocumented, of course!) but nothing that seemed a really good fit.

So in 65cbe6d I just changed this to a heap allocation. Shrug emoji :)

morrisonlevi · 2022-03-01T15:39:33Z

ext/ddtrace_profiling_native_extension/http_transport.c

+  ddprof_ffi_SendResult result;
+};
+
+inline static ddprof_ffi_ByteSlice byte_slice_from_ruby_string(VALUE string);


FYI, you can probably drop inline here; in this case it wouldn't do anything different than just static. The only exception I'm aware is when you mix inline static and plain inline (no static) functions, then sometimes the compiler will yell at you. It's not important, just a bit of a C semantic you may or may not know about :)

TIL! I expect at some point this will move to a shared file, and I'll drop the static part, so I'm thinking of leaving it for now more as a declaration of intention than expecting different semantics.

Nice link discussing this as well -> https://stackoverflow.com/questions/7762731/whats-the-difference-between-static-and-static-inline-function .

morrisonlevi · 2022-03-01T16:12:53Z

ext/ddtrace_profiling_native_extension/http_transport.c

+    Check_Type(tag_name, T_STRING);
+    Check_Type(tag_value, T_STRING);


What happens if Check_Type fails? I see it "raises" an exception but how is that technically implemented?

It's what you might've suspected: our friends setjmp and longjmp :)

Control will jump to a Ruby VM-controlled exception handler.

ivoanjo · 2022-03-02T10:56:52Z

@morrisonlevi:

Looks good to me, as far as my limitations go of knowing Ruby C APIs go. I didn't check that all the Check_Type calls are right, or that UINT2NUM values are in range or anything like that.

Thanks for the great inputs so far!

The Ruby VM functions usually are safe -- they internally check invariants and throw exceptions if things aren't as intended.
Sometimes there's an unsafe variant as well, for extra performance, but in this case I've avoided them, since we're not doing heavy stuff, and so the extra checks are very welcome.

Is there anything in particular you want me to look at?

Yes -- if I'm using libddprof in a sane way, especially all the resource cleanup related comments. Am I releasing the things that should be released, and correctly ignoring the things that libddprof takes ownership of -- including in the failure cases?

luhenry · 2022-05-10T09:23:31Z

ext/ddtrace_profiling_native_extension/http_transport.c

+
+inline static ddprof_ffi_ByteSlice byte_slice_from_ruby_string(VALUE string) {
+  Check_Type(string, T_STRING);
+  ddprof_ffi_ByteSlice byte_slice = {.ptr = (uint8_t *) StringValuePtr(string), .len = RSTRING_LEN(string)};


How is the memory pointed by the pointer returned by StringValuePtr managed? Does it point to a GCed memory area, or is it a raw pointer? If a raw pointer, how do we make sure it doesn't leak (especially regarding the use of rb_raise)?

It's a raw pointer to the actual char * that Ruby strings keep. So it has the same lifetime as the Ruby object, and we don't need to clean it up, since the Ruby GC will manage its lifetime (when the object gets collected, the string will get free'd).

On the other hand, we do have to be careful around concurrency, e.g. we must never manipulate Ruby strings if we have released the Global VM Lock, because Ruby may decide to move/change/collect them concurrently.

luhenry · 2022-05-10T09:24:37Z

ext/ddtrace_profiling_native_extension/http_transport.c

+  struct call_exporter_without_gvl_arguments args = {.exporter = exporter, .request = request};
+  // TODO: We don't provide a function to interrupt reporting, which means this thread will be blocked until
+  // call_exporter_without_gvl returns.
+  rb_thread_call_without_gvl(call_exporter_without_gvl, &args, NULL, NULL);


What happens to the memory pointed by StringValuePtr when the GVL isn't held? Can a GC run and move things around?

Yes, it can clean it up (if we don't keep a GC-visible reference to it), move it, etc.

This is why we can't release the GVL until after we build the libddprof request. Only afterwards, when libddprof has copied stuff to its own internal representation can we release the GVL.

luhenry · 2022-05-10T09:27:19Z

ext/ddtrace_profiling_native_extension/http_transport.c

+  VALUE code_provenance_data,
+  VALUE tags_as_array
+) {
+  ddprof_ffi_NewProfileExporterV3Result exporter_result = create_exporter(exporter_configuration, tags_as_array);


Do we need to build a new exporter because we can't yet set tags after the creation of the exporter?

That was one of the reasons I did it (see 5d12ae6 ) although that is no longer the case. Quoting from that commit description:

Reusing the libddprof exporter has a number of disadvantages:

We need to have quite a bit of boilerplate to be able to represent
them as Ruby objects

Tags are set at exporter creation, so we would need to somehow
tackle the issue of tags such as runtime-id and pid that can change
during the lifetime of a process

(⬆️ This has been fixed by Levi in recent versions of libddprof)

There were some concerns around libddprof and Ruby apps that fork,
although we didn't look into them in much detail

After benchmarking both approaches using the
benchmarks/profiler_http_transport.rb benchmark, the performance
of an HttpTransporter that does not reuse the exporter is more
than acceptable. I measured > 1000 reports per second on my laptop;
and the expected usage for this class is one report per minute, so
I think we're good.

Thus, let's refactor HttpTransport to create a new exporter
every time we report:

This allows the tags to change from report to report (and thus
they move back to be part of the Flush, as they were in the
pre-HttpTransport times)

We eliminate the need for the HttpTransport::Exporter class that
wrapped the native libddprof pointer.

No need to care about potential issues with forks.

I'd say the main advantage is the fork-safety -- Ruby apps can and do fork, at any time, so we could always leave an exporter in a weird internal state, so it seems simpler and less error-prone to not reuse it.

ivoanjo · 2022-05-16T11:03:10Z

I had previously created #1936 which included a bunch of extra fixes that hadn't made it to this PR. While revisiting these PRs, I don't think having #1936 separate from this PR makes sense anymore (it accumulated too many useful fixes), so I've added the commits from it to this PR, and closed that PR.

ivoanjo · 2022-05-17T14:54:01Z

I've rebased this PR on top of current master, as well as added a few fixes to make sure it's in a shippable state :)

(The extraction originally came from `http_transport.c`, added in #1923 but after a rebase only the addition of the new header remained)

…dprof One of the features provided by `libddprof` is reporting profiles. The `HttpTransport` class exposes this functionality to Ruby code, and will in the future replace most of the code in `lib/datadog/profiling/transport/. This commit and PR only adds the `HttpTransport` without wiring it up; that will come in a separate PR.

The number of tags is variable, and we expect not that big, but the failure sematics of stack allocation are completely bonkers so let's just switch this to a regular heap allocation.

…ise error In #1932 we did the groundwork required to support parsing the unix domain socket configuration from a c.tracing.transport_options configuration. Thus, we can now support unix domain sockets configured by the user without any other change (assuming that PR is merged before this commit is). The `transport_configuration_proc` can still contain weird transport configurations, but if that's the case, we just ignore them and print a warning. This is because we don't support every setting a user may provide in that proc. Worst case, for customers with a really custom config, is that the profiler won't be able to report data, and the customer will need to reach out to support for help updating it. (But we won't raise an exception or stop the profiler from starting).

Code provenance collection can be disabled, and in that case should not be reported. This was already a use case supported by the previous transport code, but it was missed when `HttpTransport` got created.

See DataDog/libddprof#33 for details.

This benchmark can be used to evaluate the performance and memory correctness of the `HttpTransport` class. In particular, it was used to fix a memory leak (and to validate that we don't have others), and was used to validate the change where `HttpTransport` creates a new exporter for every report, rather than reusing exporters.

Reusing the libddprof exporter has a number of disadvantages: * We need to have quite a bit of boilerplate to be able to represent them as Ruby objects * Tags are set at exporter creation, so we would need to somehow tackle the issue of tags such as runtime-id and pid that can change during the lifetime of a process * There were some concerns around libddprof and Ruby apps that fork, although we didn't look into them in much detail After benchmarking both approaches using the `benchmarks/profiler_http_transport.rb` benchmark, the performance of an `HttpTransporter` that does not reuse the exporter is more than acceptable. I measured > 1000 reports per second on my laptop; and the expected usage for this class is one report per minute, so I think we're good. Thus, let's refactor `HttpTransport` to create a new exporter every time we report: * This allows the `tags` to change from report to report (and thus they move back to be part of the `Flush`, as they were in the pre-`HttpTransport` times) * We eliminate the need for the `HttpTransport::Exporter` class that wrapped the native libddprof pointer. * No need to care about potential issues with forks.

…gs array

Because we are using `rb_thread_call_without_gvl2`, we need to be careful not to assume that our code ran at all after it returns.

Previously we were sending all webrick logs to a `StringIO` that was never checked, which meant that if something was wrong inside webrick, we would not know about it (since webrick runs on a background thread). Instead, I've configured webrick to log to stderr, and raised the log level so only important issues get printed.

The really long timeout may make a failing test hang, whereas with a shorter timeout libddprof will just fail the request and the test will finish with an error.

There is a bug in webrick in Ruby 2.2 when used with `DoNotListen: true` which broke in our tests. I've added a workaround for it.

Whenever something is dynamically allocated, we must ensure that either no exceptions can be thrown or that we clean it up before throwing an exception.

This is becoming a common-enough operation that it makes sense to have a nice wrapper for it.

…ddprof objects The previous version of the code went: 1. Create libddprof exporter 2. Convert Ruby VALUE objects into libddprof equivalents 3. Fire off the request In the happy path, this worked fine. But if in step 2 any exception was raised (because some object was `nil` or of an unexpected type), then the exporter would be leaked. To fix this, I've flipped the code around so that the order now goes: 1. Convert Ruby VALUE objects into libddprof equivalents 2. Create libddprof exporter 3. Fire off the request Thus, we now do all of the work that may raise exceptions up front, and only then do we start calling libddprof APIs which allocate dynamic memory. (I did slightly move the code around to minimize the diff, and that's why `build_request` actually became `_native_do_export`, and the old code in `_native_do_export` became the new `perform_export`).

Make sure we clean up `tags` if logging failed with an exception.

It took me a bit to wrap my head around this one, but `rb_thread_call_without_gvl2` can return with an "interrupt". This interrupt may mean that the VM has some work to do (and wants us to call something like `rb_thread_check_ints()`), BUT that SHOULD NOT be a reason for the export to fail, as this is expected to happen from time to time. I initially missed this subtlety, and it manifested itself as flaky tests in CI (this seems to happen more often in Ruby 2.3 and below). Tests would randomly fail with the `Interrupted before call_exporter_without_gvl ran` message we previously had. After some experimentation, I found I could reproduce on any Ruby version, if I just did enough calls to export: ```ruby context 'when server returns a 5xx failure' do let(:server_proc) { proc { |_req, res| res.status = 503 } } it 'logs an error' do expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/) 10_000.times do |i| print '@' if i % 100 == 0 http_transport.export(flush) end end end ``` The fix here is to separate two different cases: * There's an interrupt and the VM wants to process it. This shouldn't fail the export -- we just let the VM do whatever and retry our call. * There's an exception to be raised. This will fail the export.

ivoanjo · 2022-05-25T11:26:35Z

I had held back on marking this PR as non-draft as I was seeing a few flaky tests related to releasing the GVL.

I was able to track them down and fix them in 4cd898e, so this is ready to go!

(I also rebased it on top of current master)

morrisonlevi · 2022-05-25T20:39:26Z

ext/ddtrace_profiling_native_extension/http_transport.c

+    // We're in a weird situation that libddprof doesn't quite support. The ddprof_ffi_Request payload is dynamically
+    // allocated and needs to be freed, but libddprof doesn't have an API for dropping a request because it's kinda
+    // weird that a profiler builds a request and then says "oh, nevermind" and throws it away.
+    // We could add such an API, but I'm somewhat hesitant since this is a really bizarre corner case that looks quite


I'll included it in the next version. I was just noticing independently we don't have such a thing. How does ddprof_ffi_Request_drop sound?

Uh, that's nice! Yeah ddprof_ffi_Request_drop works great for me, thanks!

ivoanjo · 2022-05-26T08:52:40Z

Thanks y'all! Merging ahead! CI is red due to a lint issue that has been fixed in master, but I don't think it's worth chasing it down by merging master into this branch or rebasing just because rubocop wants an extra blank line somewhere.

…of failing After fixing the VM interruptions issue for `HttpTransport` in 4cd898e (#1923), It looks to me that the same issue could happen to the `StackRecorder`. I tried to reproduce the issue in a similar way as we did for the `HttpTransport` but just running `serialize` in a loop did not trigger it. On a hunch, I decided to hack it into the `HttpTransport` specs (since `HttpTransport` seemed to be able to cause the VM to have pending interrupts) and I was able to see it fail, very rarely on Ruby 2.7: ```diff @@ -419,9 +419,13 @@ RSpec.describe Datadog::Profiling::HttpTransport do let(:server_proc) { proc { |_req, res| res.status = 503 } } it 'logs an error' do - expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/) + #expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/) - http_transport.export(flush) + 10_000.times do |i| + print '@' if i % 100 == 0 + Datadog::Profiling::StackRecorder.new.serialize + http_transport.export(flush) + end end end ``` TL;DR `rb_thread_call_without_gvl2` can return BEFORE running our code when the VM has a pending interruption, so we need to let the VM do its work, but there's no reason to fail the action (in this case, fail to serialize), if all the VM wants is to run some callback. So instead we retry running our code until it either runs or the VM raises an exception. The approach here is simpler than in `HttpTransport` because the heap allocations happen only inside `call_serialize_without_gvl`. In `HttpTransport` the code is more awkward because there are heap allocations done before calling `rb_thread_call_without_gvl2` and those need to be cleaned up as well.

…of failing After fixing the VM interruptions issue for `HttpTransport` in 4cd898e (#1923), I suspected that the same issue could happen to the `StackRecorder`. It took a few attempts to reproduce this issue, but the easiest way seems to be to "send ctrl+c to itself" by adding ```c #include <signal.h> kill(getpid(), SIGINT); ``` before the call to `rb_thread_call_without_gvl2`. This made the VM flag an interrupt, which as long as no exception was raised (and RSpec has its own interrupt handler) simulated the issue. TL;DR `rb_thread_call_without_gvl2` can return BEFORE running our code when the VM has a pending interruption, so we need to let the VM do its work, but there's no reason to fail the action (in this case, fail to serialize) if all the VM wants is to run some callback and then we can get on with our work. So instead we retry running our code until it either runs or the VM actually raises an exception. The approach here is simpler than in `HttpTransport` because the heap allocations happen only inside `call_serialize_without_gvl`; in `HttpTransport` there are heap allocations done before calling `rb_thread_call_without_gvl2` and thus a more complex approach is needed.

…tpTransport` class Everything in `lib/datadog/profiling/transport/` already existed prior to the migration to the libddprof-based `HttpTransport` class (#1923). Temporarily, and just in case there are unknown-unknowns with `HttpTransport`, I've re-imported a few classes, and modified them so they will be able to be used **instead of** `Datadog::Profiling::HttpTransport`. This is only temporary, and will be deleted/reverted later.

morrisonlevi reviewed Mar 1, 2022

View reviewed changes

This was referenced Mar 3, 2022

[PROF-4780] Report profiles using libddprof, drop old codepath #1924

Closed

[PROF-4780] WIP: libddprof integration improvements and tweaks #1936

Closed

ivoanjo force-pushed the ivoanjo/prof-4779-enable-libddprof branch from 8ae9ea1 to da925a0 Compare April 25, 2022 14:40

luhenry reviewed May 10, 2022

View reviewed changes

luhenry approved these changes May 10, 2022

View reviewed changes

ivoanjo force-pushed the ivoanjo/prof-4779-enable-libddprof branch from da925a0 to be6fb26 Compare May 16, 2022 09:48

ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from c85d4e3 to 59637a9 Compare May 16, 2022 10:51

Base automatically changed from ivoanjo/prof-4779-enable-libddprof to master May 17, 2022 08:41

ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from 00280c5 to e10a0e9 Compare May 17, 2022 11:04

ivoanjo changed the title ~~[PROF-4780] Add HttpTransport class to report profiles through libddprof~~ [PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof May 17, 2022

ivoanjo changed the title ~~[PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof~~ [PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof May 17, 2022

ivoanjo mentioned this pull request May 18, 2022

[PROF-4780] Add libddprof dependency to power the new Ruby profiler #2028

Merged

ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from df58eb4 to 81ff330 Compare May 19, 2022 09:17

ivoanjo changed the title ~~[PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof~~ [PROF-4780] Add HttpTransport class to report profiles through libddprof May 19, 2022

ivoanjo added a commit that referenced this pull request May 19, 2022

Create libddprof_helpers.h and extract char_slice_from_ruby_string

3ca061f

(The extraction originally came from `http_transport.c`, added in #1923 but after a rebase only the addition of the new header remained)

ivoanjo added 8 commits May 25, 2022 12:16

Introduce new Datadog::Profiling::Flush class

aa589b1

Allocate memory for tags array on the heap, instead of in the stack

3dbf07e

The number of tags is variable, and we expect not that big, but the failure sematics of stack allocation are completely bonkers so let's just switch this to a regular heap allocation.

Support missing code provenance in HttpTransport

fdb9e77

Code provenance collection can be disabled, and in that case should not be reported. This was already a use case supported by the previous transport code, but it was missed when `HttpTransport` got created.

Update to use new ddprof File struct

bb32d17

See DataDog/libddprof#33 for details.

ivoanjo added 13 commits May 25, 2022 12:18

Fix potential memory leak by validating endpoint before allocating ta…

7174b8c

…gs array

Safely handle interruption of thread before send gets called

5b5320b

Because we are using `rb_thread_call_without_gvl2`, we need to be careful not to assume that our code ran at all after it returns.

Fix compilation on older Rubies

13de57e

Reduce upload_timeout_seconds in spec

d9b675c

The really long timeout may make a failing test hang, whereas with a shorter timeout libddprof will just fail the request and the test will finish with an error.

Fix broken test for unix socket reporting on Ruby 2.2

6dc0a82

There is a bug in webrick in Ruby 2.2 when used with `DoNotListen: true` which broke in our tests. I've added a workaround for it.

Rubocop fixes

a26d33a

Use char_slice_from_ruby_string from helpers

c817038

Fix potential memory leaks in error code paths for HttpTransport

85700f2

Whenever something is dynamically allocated, we must ensure that either no exceptions can be thrown or that we clean it up before throwing an exception.

Extract helper for creating Ruby string from libddprof Vec_u8

02f1041

This is becoming a common-enough operation that it makes sense to have a nice wrapper for it.

Fix potential tags memory leak in case of exception

cbb2a69

Make sure we clean up `tags` if logging failed with an exception.

ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from 81ff330 to 4cd898e Compare May 25, 2022 11:20

ivoanjo marked this pull request as ready for review May 25, 2022 11:26

ivoanjo requested a review from a team May 25, 2022 11:26

marcotc approved these changes May 25, 2022

View reviewed changes

morrisonlevi reviewed May 25, 2022

View reviewed changes

Update comment after PR discussion

ca3c08c

ivoanjo merged commit 2a1e7a3 into master May 26, 2022

ivoanjo deleted the ivoanjo/prof-4780-add-libddprof-httptransport branch May 26, 2022 08:52

github-actions bot added this to the 1.2.0 milestone May 26, 2022

ivoanjo mentioned this pull request May 26, 2022

Handle VM interruptions during StackRecorder serialization instead of failing #2048

Merged

ivoanjo mentioned this pull request May 31, 2022

[PROF-4780] Report profiles using libddprof by default #2059

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923

[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923

ivoanjo commented Mar 1, 2022 •

edited

Loading

morrisonlevi left a comment

morrisonlevi Mar 1, 2022

ivoanjo Mar 2, 2022

morrisonlevi Mar 1, 2022

ivoanjo Mar 2, 2022

morrisonlevi Mar 1, 2022

ivoanjo Mar 2, 2022

ivoanjo commented Mar 2, 2022

luhenry May 10, 2022

ivoanjo May 10, 2022

luhenry May 10, 2022 •

edited

Loading

ivoanjo May 10, 2022

luhenry May 10, 2022

ivoanjo May 10, 2022

ivoanjo commented May 16, 2022

ivoanjo commented May 17, 2022

ivoanjo commented May 25, 2022

morrisonlevi May 25, 2022 •

edited

Loading

ivoanjo May 26, 2022

ivoanjo commented May 26, 2022

		Check_Type(tag_name, T_STRING);
		Check_Type(tag_value, T_STRING);

[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923

[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923

Conversation

ivoanjo commented Mar 1, 2022 • edited Loading

morrisonlevi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luhenry May 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo commented May 16, 2022

ivoanjo commented May 17, 2022

ivoanjo commented May 25, 2022

morrisonlevi May 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo commented May 26, 2022

ivoanjo commented Mar 1, 2022 •

edited

Loading

luhenry May 10, 2022 •

edited

Loading

morrisonlevi May 25, 2022 •

edited

Loading