Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9263] Add experimental support for profiling code hotspots when used with opentelemetry ruby gem #3510

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 6, 2024

What does this PR do?

This PR adds experimental support for getting profiling code hotspots data (including endpoint profiling) when profiling processes being traced using the opentelemetry ruby gem directly.

Note that this differs from the recommended way of using opentelemetry with the ddtrace library, which is to follow the instructions from https://docs.datadoghq.com/tracing/trace_collection/custom_instrumentation/otel_instrumentation/ruby/ .

The key difference is -- this PR makes code hotspots work even for setups that opt to not use require 'datadog/opentelemetry' (which is the recommended and easier way).

The approach taken here is similar to #2342 and #3466: we peek inside the implementation of the opentelemetry gem to extract the information we need (namely the span id, local root span id, trace type, and trace endpoint). This approach is potentially brittle, which is why the code is written very defensively, with the aim of never breaking the application (or profiling) if something is off -- it just won't collect code hotspots.

Motivation:

We have a customer interested in running this setup, so hopefully they'll be able to test this PR and validate if it works for them.

Furthermore, I'm hoping to see if the opentelemetry Ruby folks would be open to tweaking their APIs to be more friendlier to tools such as the profiler, but for now I opted for getting our hands dirt.

Additional Notes:

I'm opening this PR as draft until we can get feedback from the customer and see if this works for them.

How to test the change?

On top of the added test coverage, I was able to see code hotspots working for the following sinatra example app:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'

  gem 'rackup'
  gem 'dogstatsd-ruby'
  gem 'ddtrace', git: 'https://github.com/datadog/dd-trace-rb', branch: 'ivoanjo/prof-9263-otlp-ruby-code-hotspots'
  gem 'sinatra'
  gem 'opentelemetry-api'
  gem 'opentelemetry-sdk'
  gem 'opentelemetry-instrumentation-sinatra'
  gem 'opentelemetry-exporter-otlp'
  gem 'pry'
end

require 'sinatra/base'
require 'opentelemetry/sdk'
require 'pry'

Datadog.configure do |c|
  c.service = 'ivoanjo-testing-opentelemetry-test'
  c.profiling.enabled = true
end

# Configure OpenTelemetry
OpenTelemetry::SDK.configure do |c|
  c.service_name = 'ivoanjo-testing-opentelemetry-test'
  c.use 'OpenTelemetry::Instrumentation::Sinatra'
end

class MyApp < Sinatra::Base
  get '/' do
    OpenTelemetry::Trace.current_span.add_attributes({'runtime-id' => Datadog::Core::Environment::Identity.id})
    sleep 1
    'Hello, OpenTelemetry!'
  end
end

MyApp.run!

After doing a few requests, here's how this looks:

image


image

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

…tel gem

Things missing:

* Specs conflict with ddtrace otel specs (need to poke at appraisals)
* Missing endpoint support
While we don't need the actual span object to read the span ids, we
will need it to read the endpoint names.
… specs

I'm... unhappy about this, but couldn't think of anything better that
wouldn't involve refactoring the ddtrace tracing otel support and
that seems even worse.
@ivoanjo ivoanjo requested review from a team as code owners March 6, 2024 15:06
@ivoanjo ivoanjo marked this pull request as draft March 6, 2024 15:06
@github-actions github-actions bot added the profiling Involves Datadog profiling label Mar 6, 2024
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! And so safe you should be designing baby furniture 😄

@@ -734,6 +765,11 @@ static void trigger_sample_for_thread(
struct trace_identifiers trace_identifiers_result = {.valid = false, .trace_endpoint = Qnil};
trace_identifiers_for(state, thread, &trace_identifiers_result);

if (!trace_identifiers_result.valid) {
Copy link
Contributor

@AlexJF AlexJF Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth/possible doing a bit of extra work at the start to arrive at a sticky decision here and short-circuit constantly failing trace_identifiers_for with all its rb_var_get if ddtrace is not used for tracing at all?

Or thinking is that we want to support situations where there's a mix of ddtrace and pure-ot traces and/or the ability to change between one and the other dynamically (e.g. via a feature flag)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your suggestion makes sense.

My intent here in checking both is that the profiler may start quite early in the app lifecycle, so we may not know which one is going to be used yet.

Or thinking is that we want to support situations where there's a mix of ddtrace and pure-ot traces and/or the ability to change between one and the other dynamically (e.g. via a feature flag)?

I'm not sure mixing is even possible at this point, since the ddtrace otel support monkey patches itself pretty deep into opentelemetry (which is why I needed to contort a bit to be able to test both).

For that reason, and after our last discussion, I think it makes sense to stop checking opentelemetry once we see data coming from ddtrace traces.

The reverse is harder to figure out, actually. It would be weird, but not impossible, for an app that started with opentelemetry to then switch over to ddtrace.


TL;DR: I'll wait for feedback from our customer on how this is working before acting on this comment, just in case we end up going in a completely different direction BUT I'll definitely come back to it before marking the PR as non-draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants