Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Trace forwarding to subscribers is not working correctly #12752

Open
singhbaljit opened this issue Mar 21, 2024 · 5 comments
Open
Assignees

Comments

@singhbaljit
Copy link

singhbaljit commented Mar 21, 2024

What happened?

I've OpenTelemetry traces enabled, along side Trace All Messages. Following instructions from the docs:

opentelemetry {
  exporter { endpoint = "http://localhost:4317" }
  traces {
   enable = true
   filter.trace_all = true
 }
}
  • When publishing a message with user property traceparent, EQMX sends the same value to the subscribers. This is incorrect. The W3C traceparent includes a trace and span id (<version>-<traceid>-<spanid>-<flags>). The span id in the published message belongs to the publisher. EMQX should overwrite this value with its own span. Namely, the span in the message received by the subscriber should be the send_published_message span. This allows downstream processes to use the new span as its parent to continue building the distributed trace chain.
  • When publishing a message without user property traceparent and filter.trace_all = true, EQMX sends no traceparent to the subscriber.

What did you expect to happen?

EQMX should send an updated traceparent to the subscribers, where the span id is updated for each subscriber of the message. Based on the documentation, it seems like send_published_message is what it should be. This should work even if the publisher sends no traceparent (given that trace all messages is enabled).

How can we reproduce it (as minimally and precisely as possible)?

No response

Anything else we need to know?

No response

EMQX version

5.5.1

OS version

EQMX on Kubernetes

Log files

@singhbaljit singhbaljit changed the title OpenTelemetry Traces forwarding to subscribers is not working currently OpenTelemetry Traces forwarding to subscribers is not working correctly Mar 21, 2024
@singhbaljit singhbaljit changed the title OpenTelemetry Traces forwarding to subscribers is not working correctly OpenTelemetry Trace forwarding to subscribers is not working correctly Mar 21, 2024
@SergeTupchiy SergeTupchiy self-assigned this Mar 21, 2024
@SergeTupchiy
Copy link
Contributor

SergeTupchiy commented Mar 21, 2024

Hi @singhbaljit.

This is an intended behavior that is based on MQTT spec and OpenTelemetry semantic conventions for messaging systems:

  1. If EMQX receives trace context in a published message, e.g., traceparent/tracestate User-property for MQTT v5.0, it must be sent unaltered when forwarding the Application Message to a Client to conform with MQTT specification 3.3.2.3.7.
  2. https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#context-propagation :

A message may traverse many different components and layers in one or more intermediaries when it is propagated from the producer to the consumer(s). To be able to correlate consumer traces with producer traces using the existing context propagation mechanisms, all components must propagate context down the chain.

Messaging systems themselves may trace messages as the messages travels from producers to consumers. Such tracing would cover the transport layer but would not help in correlating producers with consumers. To be able to directly correlate producers with consumers, another context that is propagated with the message is required.

A message creation context allows correlating producers with consumers of a message and model the dependencies between them, regardless of the underlying messaging transport mechanism and its instrumentation.

The message creation context is created by the producer and should be propagated to the consumer(s). Consumer traces cannot be directly correlated with producer traces if the message creation context is not attached and propagated with the message.

A producer SHOULD attach a message creation context to each message. If possible, the message creation context SHOULD be attached in such a way that it cannot be changed by intermediaries.

@SergeTupchiy
Copy link
Contributor

In other words, traces emitted by EMQX provide extra level of details and cover intermediary level (messaging system itself).
One can disable OpenTelemetry traces in EMQX and EMQX will still perfectly participate in distributed tracing by simply propagating trace context from publishers to subscribers (as described above).

@singhbaljit
Copy link
Author

singhbaljit commented Mar 21, 2024

@SergeTupchiy thank you for responding promptly, and thank you for referencing the documentation. It was wrong for me to call it "incorrect". EMQX is following the MQTT v5 spec correctly: that is, the user properties are not mutated. However, I do still think this is functionally incorrect. Here's effectively what the trace map looks like:
eqmx2

@singhbaljit
Copy link
Author

singhbaljit commented Mar 21, 2024

I also don't think the OpenTelemetry document is saying to forward traceparent unmutated. It is only saying that the context must be propagated, which means trace id must be unmutated; the span id could/should change. This is exactly what happens in Spring Boot Kafka (Micrometer).

I do think the MQTT v5 spec and the W3C context propagation for MQTT are in contradiction. I can raise this issue there as well. But do you think it possible to add another configuration to enable overriding the traceparent?

@SergeTupchiy
Copy link
Contributor

Messaging systems themselves may trace messages as the messages travels from producers to consumers. Such tracing would cover the transport layer but would not help in correlating producers with consumers.
A producer SHOULD attach a message creation context to each message. If possible, the message creation context SHOULD be attached in such a way that it cannot be changed by intermediaries.

Doesn't it say that any tracing done by messaging system itself is internal and should not affect producer/consumer correlation?
Mutating trace id breaks any correlation and makes any distributed tracing impossible. This is quite obvious, so I don't think that this is the only point of the Open Telemetry document statements mentioned above.

Yes, it is technically feasible to override traceparent in User Properties sent out from EMQX to subscribers (so that subscribers can link to EMQX send_published_message spans) and make this behaviour configurable. It may be considered for development as a new feature (if not rejected), but it's definitely not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants