Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry API #2332

Merged
merged 82 commits into from
Nov 19, 2023
Merged

OpenTelemetry API #2332

merged 82 commits into from
Nov 19, 2023

Conversation

PROFeNoM
Copy link
Contributor

@PROFeNoM PROFeNoM commented Oct 27, 2023

Description

Trace API

  • Span Events and Span Links are missing

Metrics

  • Not supported

Log (Appenders)

  • Logs correlation is properly done when using Monolog log appender. Still, please review HandlerTest.php to confirm behavior with 128-bit trace identifiers (i.e., trace id: hex, span id: dec)

Context

  • Spans created using DD's API are only converted to OTel spans when the OTel API is used and a context update is triggered (e.g., Context::getCurrent). Otherwise, they aren't converted (Reason: Creating 1 OTel span for each DD Span could become expensive for massive traces, considering OTel's interface carries many instances).
  • Creating a span using DD's API assumes that its scope becomes activated, and activates ancestors from the oldest non-activated ancestor to the newly created span
  • Converted spans use a custom InstrumentationScope (named datadog) and default resource information

Autoinstrumentation

  • OTel & DD Auto instrumentation worked concurrently as far as my testing could tell. (TODO: Add some CI tests)
  • Note to self: Look at web.request spans existence

Trace SDK

  • Since span events and span links are missing, span events/links limits are missing
  • Default span processors and exporters are supported as far as my testing could tell (e.g., Jaeger, see below)
    image

Profiler


Misc:

  • Datadog sources (src/ directory) will now be copied to the artifacts. It is necessary for the autoloading to be done when installing with datadog-setup.php (the problem doesn't exist with .apk or .deb)

Readiness checklist

  • (only for Members) Changelog has been added to the release document.
  • Tests added for this feature/bug.

Reviewer checklist

  • Appropriate labels assigned.
  • Milestone is set.
  • Changelog has been added to the release document. For community contributors the reviewer is in charge of this task.

PROFeNoM and others added 30 commits July 25, 2023 15:15
…lemetry-api

# Conflicts:
#	ext/configuration.h
…g format

This will be required for the display of OpenTelemetry array attributes.
Note that this also removes the parenthesis from boolean/null values in order to be consistent with other languages, i.e. system-tests.

Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
# Conflicts:
#	ext/ddtrace_arginfo.h
#	ext/serializer.c
#	src/Integrations/Integrations/PHPRedis/PHPRedisIntegration.php
#	tests/ext/sandbox/safe_to_string_metadata.phpt
@bwoebi bwoebi added this to the 0.94.0 milestone Nov 10, 2023
Comment on lines +78 to +82
if (str.s && ZSTR_VAL(str.s)[ZSTR_LEN(str.s) - 1] == ',') {
ZSTR_VAL(str.s)[ZSTR_LEN(str.s) - 1] = '\0';
ZSTR_LEN(str.s)--;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This is really caused by bad handling of the trailing comma in distributed tracing parsing and has been fixed ... on the span links branch.


public function hasEnded(): bool
{
return $this->hasEnded;
Copy link
Collaborator

@bwoebi bwoebi Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally just checking for hasEnded and not getDuration() != 0, which is only triggered when Context::current is called?

start_span();
$otelSpan = Context::current();
close_span();
var_dump($otelSpan->hasEnded()); // false?

Copy link
Contributor Author

@PROFeNoM PROFeNoM Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lovely counter-example 👍 Yes, this was intentional - I didn't want to use getDuration and rely on an instance variable, although this is an O(1) (only because I found it to be prettier). But, as you showed, it only works if the context is updated.

PS: I guess $this->endEpochNanos is also wrong (when calling $span->toSpanData()) in this example because of this

Copy link
Contributor Author

@PROFeNoM PROFeNoM Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking about toSpanData, that just made me realized that... the attributes may be wrong if you set/unset attributes from the DD-span...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially if you unset

$this->context->getTraceState()
);

$this->spanProcessor->onEnd($this);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this:

  • Given that endOTelSpan is only called on Context::current(), if there's no further access to Context::current() after terminating the span, the spanProcessor->onEnd() will not be called
  • this only runs for datadog-only Spans which happen to be accessed at least once by a Context::current(), right?
  • I'm not sure what the best behaviour is, maybe, for consistency exclusively run it on OTel managed spans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that endOTelSpan is only called on Context::current(), if there's no further access to Context::current() after terminating the span, the spanProcessor->onEnd() will not be called

Yes. One alternative I also tried would have been to do an OpenTelemetryIntegration, which traces a span's termination... but faced limitations. To this day, I still think this would have been the cleanest way overall (i.e., have sort of an event when a DD span is created/closed without the OTel API so that we could simply instrument them and start/end OTel spans as needed).

this only runs for datadog-only Spans, which happen to be accessed at least once by a Context::current(), right?

Yes

I'm not sure what the best behaviour is, maybe, for consistency exclusively run it on OTel managed spans?

For consistency, it may indeed make sense

Copy link
Collaborator

@bwoebi bwoebi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some minor tradeoffs in the code, like Span::endOTelSpan fetching getTraceState() - which incidentally is only called when the next Context::current() call happens.
There might be theoretical divergence, but as long as we acknowledge such edge cases, I think it's fine.

I've left one more point for consideration, but overall looks good I think :-)

Notes:
- A span wasn't finished when calling the span processors, hence the span was considered as not ended and its duration marked to 0
- We can use an instance variable $endEpochNanos to circumvent this issue, but it's a bit uglier than just running the span processors after the (dd) span was closed, considering the span cannot be written onto (ReadableSpanInterface)
- With partial flushing, there may be an issue depending on whether the spans will be readable after flushed (if flushed)
@PROFeNoM
Copy link
Contributor Author

@bwoebi

The changes made on the endEpochNanos/hasEnded 'flags' to simplify the code, although not impacting at all DD's end result, impact Jaeger traces.

I'll be running the span processors after the Datadog span termination. It may conflict with partial flushing in the future. Alternatively, I can just rollback the changes made and use some flags, but it fundamentally conflicts as well.

We can discuss this in our meeting later on today if you care about this.

Before commit:
image

After commit:
image

image

Datadog Trace

@bwoebi
Copy link
Collaborator

bwoebi commented Nov 16, 2023

I think that's fine. Even if we implement partial flushing in future, we probably shouldn't immediately, irreversibly flush the span, but flush it slightly delayed, so that some data may still be set on the closed span, otherwise we risk breaking currently perfectly working code from other users, which nonchalantly add some meta data to the span after closing it.

@PROFeNoM PROFeNoM force-pushed the alex/feature/opentelemetry-api branch from 8fc3add to b073989 Compare November 17, 2023 09:52
@PROFeNoM PROFeNoM merged commit ef0b61a into master Nov 19, 2023
556 of 558 checks passed
@PROFeNoM PROFeNoM deleted the alex/feature/opentelemetry-api branch November 19, 2023 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:opentelemetry 🏆 enhancement A new feature or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants