Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-4780] Add HttpTransport class to report profiles through libddprof #1923

Merged
merged 27 commits into from
May 26, 2022

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 1, 2022

One of the features provided by libddprof is an http/https client for reporting profiles.

The HttpTransport class exposes this functionality to Ruby code, and will in the future replace most of the code in lib/datadog/profiling/transport/.

This PR only adds the HttpTransport class without wiring it up; that will come in a separate PR.

Because the HttpTransport class is not wired up, this is not expected to have any observable change to customers (and thus no impact).

Copy link

@morrisonlevi morrisonlevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, as far as my limitations go of knowing Ruby C APIs go. I didn't check that all the Check_Type calls are right, or that UINT2NUM values are in range or anything like that.

Is there anything in particular you want me to look at?


static VALUE create_exporter(struct ddprof_ffi_EndpointV3 endpoint, VALUE tags_as_array) {
long tags_count = rb_array_len(tags_as_array);
ddprof_ffi_Tag converted_tags[tags_count];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Ruby have macros/functions for optionally using alloca and falling back to malloc depending on the size and if the platform supports it?

PHP has these macros, and it's useful for cases like this where it should be small enough to put on the stack but in case someone passes a large array you are protected against blowing your stack.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's... a great point, thanks! I dug around the documentation and the VM headers and there are a few things there (undocumented, of course!) but nothing that seemed a really good fit.

So in 65cbe6d I just changed this to a heap allocation. Shrug emoji :)

ddprof_ffi_SendResult result;
};

inline static ddprof_ffi_ByteSlice byte_slice_from_ruby_string(VALUE string);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can probably drop inline here; in this case it wouldn't do anything different than just static. The only exception I'm aware is when you mix inline static and plain inline (no static) functions, then sometimes the compiler will yell at you. It's not important, just a bit of a C semantic you may or may not know about :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! I expect at some point this will move to a shared file, and I'll drop the static part, so I'm thinking of leaving it for now more as a declaration of intention than expecting different semantics.

Nice link discussing this as well -> https://stackoverflow.com/questions/7762731/whats-the-difference-between-static-and-static-inline-function .

Comment on lines 144 to 168
Check_Type(tag_name, T_STRING);
Check_Type(tag_value, T_STRING);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if Check_Type fails? I see it "raises" an exception but how is that technically implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's what you might've suspected: our friends setjmp and longjmp :)

Control will jump to a Ruby VM-controlled exception handler.

@ivoanjo
Copy link
Member Author

ivoanjo commented Mar 2, 2022

@morrisonlevi:

Looks good to me, as far as my limitations go of knowing Ruby C APIs go. I didn't check that all the Check_Type calls are right, or that UINT2NUM values are in range or anything like that.

Thanks for the great inputs so far!

The Ruby VM functions usually are safe -- they internally check invariants and throw exceptions if things aren't as intended.
Sometimes there's an unsafe variant as well, for extra performance, but in this case I've avoided them, since we're not doing heavy stuff, and so the extra checks are very welcome.

Is there anything in particular you want me to look at?

Yes -- if I'm using libddprof in a sane way, especially all the resource cleanup related comments. Am I releasing the things that should be released, and correctly ignoring the things that libddprof takes ownership of -- including in the failure cases?


inline static ddprof_ffi_ByteSlice byte_slice_from_ruby_string(VALUE string) {
Check_Type(string, T_STRING);
ddprof_ffi_ByteSlice byte_slice = {.ptr = (uint8_t *) StringValuePtr(string), .len = RSTRING_LEN(string)};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the memory pointed by the pointer returned by StringValuePtr managed? Does it point to a GCed memory area, or is it a raw pointer? If a raw pointer, how do we make sure it doesn't leak (especially regarding the use of rb_raise)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a raw pointer to the actual char * that Ruby strings keep. So it has the same lifetime as the Ruby object, and we don't need to clean it up, since the Ruby GC will manage its lifetime (when the object gets collected, the string will get free'd).

On the other hand, we do have to be careful around concurrency, e.g. we must never manipulate Ruby strings if we have released the Global VM Lock, because Ruby may decide to move/change/collect them concurrently.

struct call_exporter_without_gvl_arguments args = {.exporter = exporter, .request = request};
// TODO: We don't provide a function to interrupt reporting, which means this thread will be blocked until
// call_exporter_without_gvl returns.
rb_thread_call_without_gvl(call_exporter_without_gvl, &args, NULL, NULL);
Copy link

@luhenry luhenry May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the memory pointed by StringValuePtr when the GVL isn't held? Can a GC run and move things around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can clean it up (if we don't keep a GC-visible reference to it), move it, etc.

This is why we can't release the GVL until after we build the libddprof request. Only afterwards, when libddprof has copied stuff to its own internal representation can we release the GVL.

VALUE code_provenance_data,
VALUE tags_as_array
) {
ddprof_ffi_NewProfileExporterV3Result exporter_result = create_exporter(exporter_configuration, tags_as_array);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to build a new exporter because we can't yet set tags after the creation of the exporter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was one of the reasons I did it (see 5d12ae6 ) although that is no longer the case. Quoting from that commit description:

Reusing the libddprof exporter has a number of disadvantages:

  • We need to have quite a bit of boilerplate to be able to represent
    them as Ruby objects

  • Tags are set at exporter creation, so we would need to somehow
    tackle the issue of tags such as runtime-id and pid that can change
    during the lifetime of a process

(⬆️ This has been fixed by Levi in recent versions of libddprof)

  • There were some concerns around libddprof and Ruby apps that fork,
    although we didn't look into them in much detail

After benchmarking both approaches using the
benchmarks/profiler_http_transport.rb benchmark, the performance
of an HttpTransporter that does not reuse the exporter is more
than acceptable. I measured > 1000 reports per second on my laptop;
and the expected usage for this class is one report per minute, so
I think we're good.

Thus, let's refactor HttpTransport to create a new exporter
every time we report:

  • This allows the tags to change from report to report (and thus
    they move back to be part of the Flush, as they were in the
    pre-HttpTransport times)

  • We eliminate the need for the HttpTransport::Exporter class that
    wrapped the native libddprof pointer.

  • No need to care about potential issues with forks.

I'd say the main advantage is the fork-safety -- Ruby apps can and do fork, at any time, so we could always leave an exporter in a weird internal state, so it seems simpler and less error-prone to not reuse it.

@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4779-enable-libddprof branch from da925a0 to be6fb26 Compare May 16, 2022 09:48
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from c85d4e3 to 59637a9 Compare May 16, 2022 10:51
@ivoanjo
Copy link
Member Author

ivoanjo commented May 16, 2022

I had previously created #1936 which included a bunch of extra fixes that hadn't made it to this PR. While revisiting these PRs, I don't think having #1936 separate from this PR makes sense anymore (it accumulated too many useful fixes), so I've added the commits from it to this PR, and closed that PR.

Base automatically changed from ivoanjo/prof-4779-enable-libddprof to master May 17, 2022 08:41
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from 00280c5 to e10a0e9 Compare May 17, 2022 11:04
@ivoanjo ivoanjo changed the title [PROF-4780] Add HttpTransport class to report profiles through libddprof [PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof May 17, 2022
@ivoanjo ivoanjo changed the title [PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof [PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof May 17, 2022
@ivoanjo
Copy link
Member Author

ivoanjo commented May 17, 2022

I've rebased this PR on top of current master, as well as added a few fixes to make sure it's in a shippable state :)

@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from df58eb4 to 81ff330 Compare May 19, 2022 09:17
@ivoanjo ivoanjo changed the title [PROF-4780] Add libbdprof gem as dependency + Add HttpTransport class to report profiles through libddprof [PROF-4780] Add HttpTransport class to report profiles through libddprof May 19, 2022
ivoanjo added a commit that referenced this pull request May 19, 2022
(The extraction originally came from `http_transport.c`, added in
 #1923 but after a rebase only the addition of the new header remained)
…dprof

One of the features provided by `libddprof` is reporting profiles.

The `HttpTransport` class exposes this functionality to Ruby code, and
will in the future replace most of the code in
`lib/datadog/profiling/transport/.

This commit and PR only adds the `HttpTransport` without wiring it up;
that will come in a separate PR.
The number of tags is variable, and we expect not that big, but the
failure sematics of stack allocation are completely bonkers so let's
just switch this to a regular heap allocation.
…ise error

In #1932 we did the
groundwork required to support parsing the unix domain socket
configuration from a c.tracing.transport_options configuration.

Thus, we can now support unix domain sockets configured by the user
without any other change (assuming that PR is merged before this
commit is).

The `transport_configuration_proc` can still contain weird transport
configurations, but if that's the case, we just ignore them and
print a warning. This is because we don't support every setting
a user may provide in that proc.

Worst case, for customers with a really custom config, is that
the profiler won't be able to report data, and the customer will need
to reach out to support for help updating it. (But we won't raise
an exception or stop the profiler from starting).
Code provenance collection can be disabled, and in that case should
not be reported.

This was already a use case supported by the previous transport code,
but it was missed when `HttpTransport` got created.
This benchmark can be used to evaluate the performance and memory
correctness of the `HttpTransport` class.

In particular, it was used to fix a memory leak (and to validate
that we don't have others), and was used to validate the change
where `HttpTransport` creates a new exporter for every report,
rather than reusing exporters.
Reusing the libddprof exporter has a number of disadvantages:

* We need to have quite a bit of boilerplate to be able to represent
  them as Ruby objects

* Tags are set at exporter creation, so we would need to somehow
  tackle the issue of tags such as runtime-id and pid that can change
  during the lifetime of a process

* There were some concerns around libddprof and Ruby apps that fork,
  although we didn't look into them in much detail

After benchmarking both approaches using the
`benchmarks/profiler_http_transport.rb` benchmark, the performance
of an `HttpTransporter` that does not reuse the exporter is more
than acceptable. I measured > 1000 reports per second on my laptop;
and the expected usage for this class is one report per minute, so
I think we're good.

Thus, let's refactor `HttpTransport` to create a new exporter
every time we report:

* This allows the `tags` to change from report to report (and thus
  they move back to be part of the `Flush`, as they were in the
  pre-`HttpTransport` times)

* We eliminate the need for the `HttpTransport::Exporter` class that
  wrapped the native libddprof pointer.

* No need to care about potential issues with forks.
ivoanjo added 13 commits May 25, 2022 12:18
Because we are using `rb_thread_call_without_gvl2`, we need to be
careful not to assume that our code ran at all after it returns.
Previously we were sending all webrick logs to a `StringIO` that was
never checked, which meant that if something was wrong inside webrick,
we would not know about it (since webrick runs on a background thread).

Instead, I've configured webrick to log to stderr, and raised the log
level so only important issues get printed.
The really long timeout may make a failing test hang, whereas with
a shorter timeout libddprof will just fail the request and the test
will finish with an error.
There is a bug in webrick in Ruby 2.2 when used with
`DoNotListen: true` which broke in our tests.

I've added a workaround for it.
Whenever something is dynamically allocated, we must ensure that either
no exceptions can be thrown or that we clean it up before throwing
an exception.
This is becoming a common-enough operation that it makes sense to
have a nice wrapper for it.
…ddprof objects

The previous version of the code went:

1. Create libddprof exporter
2. Convert Ruby VALUE objects into libddprof equivalents
3. Fire off the request

In the happy path, this worked fine. But if in step 2 any exception
was raised (because some object was `nil` or of an unexpected type),
then the exporter would be leaked.

To fix this, I've flipped the code around so that the order now goes:

1. Convert Ruby VALUE objects into libddprof equivalents
2. Create libddprof exporter
3. Fire off the request

Thus, we now do all of the work that may raise exceptions up front,
and only then do we start calling libddprof APIs which allocate dynamic
memory.

(I did slightly move the code around to minimize the diff, and that's
why `build_request` actually became `_native_do_export`, and the old
code in `_native_do_export` became the new `perform_export`).
Make sure we clean up `tags` if logging failed with an exception.
It took me a bit to wrap my head around this one, but
`rb_thread_call_without_gvl2` can return with an "interrupt".

This interrupt may mean that the VM has some work to do (and wants us
to call something like `rb_thread_check_ints()`), BUT that SHOULD NOT be
a reason for the export to fail, as this is expected to happen from
time to time.

I initially missed this subtlety, and it manifested itself as flaky
tests in CI (this seems to happen more often in Ruby 2.3 and below).
Tests would randomly fail with the
`Interrupted before call_exporter_without_gvl ran` message we
previously had.

After some experimentation, I found I could reproduce on any Ruby
version, if I just did enough calls to export:

```ruby
    context 'when server returns a 5xx failure' do
      let(:server_proc) { proc { |_req, res| res.status = 503 } }

      it 'logs an error' do
        expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/)

        10_000.times do |i|
          print '@' if i % 100 == 0
          http_transport.export(flush)
        end
      end
    end
```

The fix here is to separate two different cases:
* There's an interrupt and the VM wants to process it. This shouldn't
  fail the export -- we just let the VM do whatever and retry our
  call.

* There's an exception to be raised. This will fail the export.
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-4780-add-libddprof-httptransport branch from 81ff330 to 4cd898e Compare May 25, 2022 11:20
@ivoanjo
Copy link
Member Author

ivoanjo commented May 25, 2022

I had held back on marking this PR as non-draft as I was seeing a few flaky tests related to releasing the GVL.

I was able to track them down and fix them in 4cd898e, so this is ready to go!

(I also rebased it on top of current master)

@ivoanjo ivoanjo marked this pull request as ready for review May 25, 2022 11:26
@ivoanjo ivoanjo requested a review from a team May 25, 2022 11:26
// We're in a weird situation that libddprof doesn't quite support. The ddprof_ffi_Request payload is dynamically
// allocated and needs to be freed, but libddprof doesn't have an API for dropping a request because it's kinda
// weird that a profiler builds a request and then says "oh, nevermind" and throws it away.
// We could add such an API, but I'm somewhat hesitant since this is a really bizarre corner case that looks quite
Copy link

@morrisonlevi morrisonlevi May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll included it in the next version. I was just noticing independently we don't have such a thing. How does ddprof_ffi_Request_drop sound?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, that's nice! Yeah ddprof_ffi_Request_drop works great for me, thanks!

@ivoanjo
Copy link
Member Author

ivoanjo commented May 26, 2022

Thanks y'all! Merging ahead! CI is red due to a lint issue that has been fixed in master, but I don't think it's worth chasing it down by merging master into this branch or rebasing just because rubocop wants an extra blank line somewhere.

@ivoanjo ivoanjo merged commit 2a1e7a3 into master May 26, 2022
@ivoanjo ivoanjo deleted the ivoanjo/prof-4780-add-libddprof-httptransport branch May 26, 2022 08:52
@github-actions github-actions bot added this to the 1.2.0 milestone May 26, 2022
ivoanjo added a commit that referenced this pull request May 26, 2022
…of failing

After fixing the VM interruptions issue for `HttpTransport` in
4cd898e (#1923), It looks to me
that the same issue could happen to the `StackRecorder`.

I tried to reproduce the issue in a similar way as we did for the
`HttpTransport` but just running `serialize` in a loop did not
trigger it.

On a hunch, I decided to hack it into the `HttpTransport` specs
(since `HttpTransport` seemed to be able to cause the VM to
have pending interrupts) and I was able to see it fail, very
rarely on Ruby 2.7:

```diff
@@ -419,9 +419,13 @@ RSpec.describe Datadog::Profiling::HttpTransport do
       let(:server_proc) { proc { |_req, res| res.status = 503 } }

       it 'logs an error' do
-        expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/)
+        #expect(Datadog.logger).to receive(:error).with(/unexpected HTTP 503/)

-        http_transport.export(flush)
+        10_000.times do |i|
+          print '@' if i % 100 == 0
+          Datadog::Profiling::StackRecorder.new.serialize
+          http_transport.export(flush)
+        end
       end
     end
```

TL;DR `rb_thread_call_without_gvl2` can return BEFORE running our
code when the VM has a pending interruption, so we need to let the VM
do its work, but there's no reason to fail the action (in this case,
fail to serialize), if all the VM wants is to run some callback.

So instead we retry running our code until it either runs or the VM
raises an exception.

The approach here is simpler than in `HttpTransport` because the
heap allocations happen only inside `call_serialize_without_gvl`.

In `HttpTransport` the code is more awkward because there are
heap allocations done before calling `rb_thread_call_without_gvl2`
and those need to be cleaned up as well.
ivoanjo added a commit that referenced this pull request May 26, 2022
…of failing

After fixing the VM interruptions issue for `HttpTransport` in
4cd898e (#1923), I suspected
that the same issue could happen to the `StackRecorder`.

It took a few attempts to reproduce this issue, but the
easiest way seems to be to "send ctrl+c to itself" by adding

```c
  #include <signal.h>

  kill(getpid(), SIGINT);
```

before the call to `rb_thread_call_without_gvl2`.

This made the VM flag an interrupt, which as long as no
exception was raised (and RSpec has its own interrupt handler)
simulated the issue.

TL;DR `rb_thread_call_without_gvl2` can return BEFORE running our
code when the VM has a pending interruption, so we need to let the VM
do its work, but there's no reason to fail the action (in this case,
fail to serialize) if all the VM wants is to run some callback
and then we can get on with our work.

So instead we retry running our code until it either runs or the VM
actually raises an exception.

The approach here is simpler than in `HttpTransport` because the
heap allocations happen only inside `call_serialize_without_gvl`;
in `HttpTransport` there are heap allocations done before calling
`rb_thread_call_without_gvl2` and thus a more complex approach is
needed.
ivoanjo added a commit that referenced this pull request May 31, 2022
…tpTransport` class

Everything in `lib/datadog/profiling/transport/` already existed prior
to the migration to the libddprof-based `HttpTransport` class
(#1923).

Temporarily, and just in case there are unknown-unknowns with
`HttpTransport`, I've re-imported a few classes, and modified them
so they will be able to be used **instead of**
`Datadog::Profiling::HttpTransport`.

This is only temporary, and will be deleted/reverted later.
ivoanjo added a commit that referenced this pull request May 31, 2022
…tpTransport` class

Everything in `lib/datadog/profiling/transport/` already existed prior
to the migration to the libddprof-based `HttpTransport` class
(#1923).

Temporarily, and just in case there are unknown-unknowns with
`HttpTransport`, I've re-imported a few classes, and modified them
so they will be able to be used **instead of**
`Datadog::Profiling::HttpTransport`.

This is only temporary, and will be deleted/reverted later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants