Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the OpenTelemetry tracer to be used with the profiler #3819

Closed
phillipuniverse opened this issue Jun 13, 2022 · 9 comments
Closed

Allow the OpenTelemetry tracer to be used with the profiler #3819

phillipuniverse opened this issue Jun 13, 2022 · 9 comments

Comments

@phillipuniverse
Copy link

phillipuniverse commented Jun 13, 2022

Which version of dd-trace-py are you using?

1.1.3

Which version of pip are you using?

22.0.3

Which version of the libraries are you using?

OpenTelemetry 1.12.0rc1

How can we reproduce your problem?

I use OpenTelemetry for all span/trace instrumentation. I would like to use the traces/spans from OpenTelemetry to give context to the ddtrace Profiler. Here's an example of what I would like to do:

from ddtrace.profiling import Profiler
from opentelemetry import trace

otel_tracer = trace.get_tracer(__name__)

Profiler(
    env="prod", service="myservice", version="abc123", url=f"http://datadog-host.internal", tracer=otel_tracer
)

Alternatively, is there a bridge maybe for the default Datadog tracer from ddtrace.tracer to simply inherit the trace/span information from OpenTelemetry?

@github-actions github-actions bot added the stale label Jul 23, 2022
@shahargl
Copy link
Contributor

shahargl commented Aug 3, 2022

@phillipuniverse Did you find any way to workaround this? (we are looking to do exactly the same)

@phillipuniverse
Copy link
Author

@shahargl not yet but I hope to make some investments in this area within the next month!

@github-actions github-actions bot removed the stale label Aug 4, 2022
@github-actions github-actions bot added the stale label Sep 4, 2022
@phillipuniverse
Copy link
Author

Potentially relevant is #4170 where the DD tracer now has knowledge of the OpenTelemetry traceparent headers.

@github-actions github-actions bot removed the stale label Sep 21, 2022
@github-actions github-actions bot added the stale label Oct 21, 2022
@phillipuniverse
Copy link
Author

Looks like #4170 was reverted, base issue at #3872.

I went down a pretty long and involved process trying to build a bridge between the OpenTelemetry tracer and the Datadog tracer. I couldn't quite get the stack profiles to work though and I think in general it's just too shaky. I'm going to try a different approach where I do the following:

  1. Take some of the work in feat: support w3c trace context #4170 (that unfortunately got reverted) and use that to start a Datadog tracer via propagation alongside the otel tracer
  2. Disable any span/trace exports that the Datadog tracer might generate since they are already captured by OpenTelemetry

This has other downsides though. I would like to avoid the Datadog instrumentation to not worry about disabling exports of spans, but if I don't do instrumentation I will have to manually establish the tracer in Django/FastAPI/pika/Celery/etc. entrypoints in my services. I need to think more about this.

At any rate, here is a gist to my sort-of working code. Profiles do show up but are not linked to spans so it doesn't truly work right.

@github-actions github-actions bot removed the stale label Dec 3, 2022
@phillipuniverse
Copy link
Author

Below is a lot of research that I did into this which has been dead-ends so far. At this point I would need some additional advice from Datadog folks in what exactly needs to be set where in the internals of the Datadog tracer to get things synced correctly between DD and otel. I'm going to pause work on this until I get additional feedback.

Summary of the issues below:

  • The profiler at some point does see trace id/span ids
  • The profiler seems to create a bunch of extra stack trace events and keeps around the last span id for a period of time. The Datadog span is almost certainly not being "finished" correctly
  • The bridge between the Otel span and the DD span is currently designed as "point in time" vs constant. Attributes need to be syncing constantly between the otel span and dd span to detect if the span is finished or not

I think probably the final conclusion of this is that the bridge will work like the existing opentracing implementattion. @Kyle-Verhoog it looks like you had were one of the initial drivers there, got some pointers for important pieces of that bridge that could apply to OpenTelemetry support? Specifically, what needs to be synced on the Datadog side?

A couple of other thoughts:

  • Is the context api the best spot to monkeypatch from OpenTelemetry?
  • OpenTelemetry has a nice SpanProcessor abstraction. This is mainly used by the exporters, but maybe this is the right place to "start" and "end" the Datadog spans?

I have an update but it's not really good news. I started going down the path of using the Datadog instrumentation and disabling export. I found a good workaround in this issue for how to disable the trace writer which works great!

But it's still not quite giving me what I want because I still have to sync the Datadog Span/Trace with the ids of the otel Span/Trace. Everything has to line up for profiler events to be associated in the APM. I couldn't find a great way to do that but maybe something like monkeypatching the Span constructor to hardcode the otel span/trace ids? But then you have another complication in ensuring that the Datadog tracer is always established after the Otel one. And, it wouldn't work in places where you might be creating a manual span via otel. I think the only viable solution then is to have the Datadog side hook into the otel side like I thought originally.

So back to my gist. I think I have a better bridge between otel and dd through the OpenTelemetry context api. That is a low-level function that is truly activates/de-activates the span. My goal with this is to take the otel context api and use that to sync in changes to the DD context api.

This works slightly better. In the gist I include a monkeypatch to the DD Recorder which is responsible for exporting profiles. I have some simple prints to figure out what the current trace and span id are. When I hit an endpoint in my Django app, I do see this printing out the span id. But you'll see that the _local_root is None, which based on my research below seems to be required. I am obviously not setting that in my bridge to convert an otel span into a Datadog span:

backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 3478091631850939025 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 3478091631850939025 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 3478091631850939025 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 3478091631850939025 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 14404735425225168374 and local root None
backend-core-core-1  | Recording event StackSampleEvent with span 3478091631850939025 and local root None

Another weird thing is that it is constantly ping-ponging between span ids well after the request is over, like it's continuing to get profile events for a previous span even after that span is done. My guess is that I need to be stopping/detaching something but I don't know where.

Then I went and tried to validate what happens without OpenTelemetry and just manually patch. So I included this:

config.env = ENVIRONMENT  # the environment the application is in
config.service = SERVICE  # name of your application
config.version = VERSION  # version of your application
patch(django=True)

And voila, I got span ids and they were all hooked up into the Datadog APM correctly. I also only see this logging when I would expect, in the context of an API request, not persistently like before.

backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137
backend-core-core-1  | Recording event StackSampleEvent with span 15632444396742362186 and local root 7301043703004759137

@brettlangdon
Copy link
Member

@phillipuniverse sorry for the long delay responding here, would you be willing to send me an email at brett.langdon@datadoghq.com I'd love to try and schedule a call with you to discuss this topic!

@github-actions github-actions bot added the stale label Jan 5, 2023
@emmettbutler
Copy link
Collaborator

I'm fairly sure that this issue was fixed by #5140, which was merged recently. @mabdinur please reopen this issue if I'm mistaken.

@emmettbutler
Copy link
Collaborator

Whoops! The new OpenTelemetry integration does not yet link profiles with otel spans.

@emmettbutler
Copy link
Collaborator

Manually doing what the Github Action was supposed to do (fixed in #7835):

This issue has been automatically closed after six months of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.

@emmettbutler emmettbutler closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants