Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the HTTP header format proposal for TraceContext propagation. #1

Merged
merged 2 commits into from
Sep 6, 2017

Conversation

bogdandrutu
Copy link
Contributor

Add the first spec for the trace-context http header format.

Copy link

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some things noticed, but overall nice job

@@ -0,0 +1,88 @@
# Trace Context HTTP Header Format

Date: 31/03/2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why this is here, though it will be maintenance causing.. wonder if there's a badge to auto-include the git timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


The value will be US-ASCII encoded (which is UTF-8 compliant).

### Version

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version or variant? maybe mention in docs this could be used by variants of the spec? or just wait and see..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As a random, interested passerby:)

The problem with having an explicit version field is that you risk breaking all receivers as soon as you increment the field, which means a user can't do a staged rollout of a system that uses a new version of the protocol.

A better strategy would be to reserve some of the options bits and then requiring current implementations to set them to zero and ignore any trailing data. This way, you can evolve your protocol gradually.

#### Trace-id

Is the ID of the whole trace forest. It is represented as a 16-bytes array,
e.g., `0x4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 0x prefix intended here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

#### Trace-options

Controls tracing options such as sampling, trace level etc. It is a 4-bytes
representing a 32-bit unsigned integer in little-endian order. The least

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java's default is big-endian.. there are a lot of java programmers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps to elaborate on this, there's a couple things in play. On one hand, if you have only one flag, the endianness isn't an issue. Most users have sampled set as a single bit, and that's the only flag used in zipkin routinely.

On the other hand, the platform default encoding of java as big-endian results in many not being aware of endian-ness in general. Mistakes coding anything binary have been difficult in zipkin, resulting in json actually. While the case endianness is "little" elsewhere, anectodally I've noticed those who have little endian platforms already know how to do big endian. Main thing is that they get annoyed when endian isn't documented. Ex when folks were doing thrift encoding I remember someone rather annoyed that thrift tbinary format was big endian without saying so (because it was implicitly that in java).

food for thought

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java's default is big-endian.. there are a lot of java programmers

not only that, but big-endian is also "network byte order". Wikipedia says:

Big-endian is the most common format in data networking; fields in the protocols of the Internet protocol suite, such as IPv4, IPv6, TCP, and UDP, are transmitted in big-endian order. For this reason, big-endian byte order is also referred to as network byte order.

If I can write the 32bit flags number as 0x00000001, it seems more natural to have 00000001 in the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that network byte order is big-endian. But probably 99% of the CPU in used are little-endian.

Let's vote here #3 for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @bhs proposal to reduce this to 1 byte is an alternative option. Updated the issue with this possibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 1-byte used for options.

```
Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000
Version = 0 (0x00)
TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x prefix will likely lead to encoding bugs as people will almost certainly copy/paste if we give opportunity to

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Version = 0 (0x00)
TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736
SpanId = 0x00f067aa0ba902b7
TraceOptions = 0x0 // not-sampled

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an example that breaks with endian, especially if we use little endian

Copy link

@reta reta Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriancole This is terrific initiative. The proposal is looking quite reasonable. I think the only debatable piece is the presence of Version. Iе would better use your suggestion, like Variant, or Encoding, or Scheme. To give an example, I would like to see something like that: Z<traceid><spanid><options> where Z would tell me that this is Zipkin encoding scheme. For HTrace it could be H<spandId><options> (there is no separate Trace Id, it is encoded in Span Id, at least in 4.1). It is the same 8 bits but using characters instead of 0. Not sure if it makes sense to you guys. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more endianness.

I like the idea of maybe having multiple supported systems, but one of the main purpose is to make all these systems work together so having separate encodings for each system will not be that much of a win.

@wu-sheng
Copy link
Contributor

wu-sheng commented Apr 7, 2017

@adriancole Will this become the TraceContext propagation spec of zipkin? Or do you have wider purpose?

@codefromthecrypt
Copy link

@wu-sheng this is a spec for dapper-like systems. It will take some time for something like this to work its way through implementations and interop, but it is indeed something I'd like to see possible in zipkin tracers. For example, there's some TODO: work left with brave to show an alternative context header, and this format is what I'll use to test that.

@wu-sheng
Copy link
Contributor

wu-sheng commented Apr 7, 2017

@adriancole I will keep watching this. Your trace-related projects are very interesting for me.

Copy link
Contributor

@bhs bhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, @bogdandrutu. I made some comments.

@bogdandrutu what's the vision/scope for the TraceContext github org?

Date: 31/03/2017

A trace context header is used to pass trace context information across systems
for a HTTP request. Our goal is to share this with the community so that various
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear a lot of pain around tracing things like messages buses / kafka / etc. Might be worth intentionally decoupling from HTTP and making this about any messaging tech that has room for metadata/headers.


#### Trace-id

Is the ID of the whole trace forest. It is represented as a 16-bytes array,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The room in 🇩🇪 seemed to favor a variable-length field here... I thought that was for good reason, too: 128 bits of precision is completely unnecessary for the purposes of most tracing systems, yet insufficient for others we heard about that needed 192. Why does this spec need to try and get this right? (We can set a hard max at 32 bytes or something if you are concerned about unbounded allocations)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw As a unit, zipkin usually implement things a lot later than the first request after significant cross-project interest exists, and there are hands to do it. Ex 128bit trace ids was implemented probably 2 years after the first sites started forking zipkin to accomplish this. This happened after a couple more recent glitches occurred including users of AWS lambda not being able to reuse the uuid there, as well zipkin users who have hybrid sites that employ stackdriver (which is 128bit). This topic has been discussed at length on other forums, so not trying to replicate it here. Suffice to say that 128bit is definitely in use in zipkin sites.

To make things less conjectury, it would be nice to have a matrix of tracing systems and their trace context requirements someplace, maybe on a google doc or otherwise.

In practice, Zipkin has never been asked directly to support a bit length higher than 128. However, we have been asked for more "squishy" things like opaque strings. Also, some tracers like sleuth were asked for lambda, which has an interesting timestamp+identifier trace id format. Having a matrix around can make things more objective as people can see if all of the systems they care about can work with a spec or not. It won't be about most, rather what site is running. For example, 9/10 is meaningless if that last one is the only one they use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we accept variable size context we should definitely have a max size for both trace_id/span_id if 32 bytes fits all the systems then we should probably change to that. Do you know any system that is willing to change to a common format and uses more than 128bits? As @adriancole mentioned for the moment the systems that are interested in using this format are Zipkin and Google and both are fine with 128bits.

One option to move forward is to make v0 having fixed sizes and we can define the v1 later when we have a clear system that wants to use this and requires more than 128bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also filed an issue to keep track of who wants/is willing to use this format.

Please add your project here #4. Add any extra requirements.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracelytics uses 160-bit trace IDs currently, and we are interested in a common format. FWIW AWS X-Ray uses a 128-bit trace ID format consisting of a 64-bit timestamp concatenated (using a hyphen) to a 96-bit unique ID, both encoded in hex.

*Valid sampled Trace-Context:*

```
Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is over-optimized / pre-optimized for performance. In the context of an RPC, parsing these things is going to be a joke from an overhead standpoint... I'd rather have an id format that's human-readable, or at least human-parseable (without counting characters by hand, that is).

E.g., this:

Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01000000

Copy link

@codefromthecrypt codefromthecrypt Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made comment elsewhere but I also agree a delimited format is near required from an operator POV. Especially as people cross-reference trace ids in other systems. Like splunk search shows something where one system logs the trace id as a logging key. Copy/paste this into zipkin or another system. There's also the "curl"ability which is "can a user work only having curl?" Straightforwardness is important for adoption, or rejection prevention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not unreasonable to add an extra character especially for the HTTP/String version to make this more readable.

If this is an agreement I think we should do it this way. I will start few issues where we can vote for certain things to get all these concerns/issues/questions resolved and CC interested people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please vote/comment here: #2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Used a delimiter between all fields.

Implementation may decide to completely ignore the trace-context if the span-id
is invalid.

#### Trace-options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's odd to me that we support a versioning bit, yet have this 4-byte thing we only have 1 bit specified for... we could alternatively just have a single byte for version 0 and skip all of the endianness discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this to fill up (the 4-byte options) with things like what you proposed (sampling probability). Uber also suggested an extra bit for deferred sampling decision.

I am trying to get for the moment the minimum requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to collect data about how many bytes we need. So far the list contains 3 (based on Jager requirements).

If the list is less than 8 we can definitely go with 1 byte for the options in v0.

Should we have the sampling probability that @bhs proposed as a separate field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 1-byte used for options.

@@ -0,0 +1,88 @@
# Trace Context HTTP Header Format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which systems are committed to support this? Which systems would you like to support this? Might be helpful to @-mention people from the latter so we can have any debates before this is merged.

Copy link

@codefromthecrypt codefromthecrypt Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting in 2p eventhough the question was for @bogdandrutu :)

TL;DR; I'd expect tracing systems which maintain all of their tracers to make a more top-down decision, but yeah not currently the case in zipkin as this trace context is compatible with zipkin's wire format.

Currently, some zipkin-compatible tracers support vendor-specific or not widely used formats. Those types of tracers will have an easier time with this since zipkin's trace context isn't inherently incompatible with this specification (at the moment). I'd expect the cross-section of google+zipkin users to be first to ask, as it is likely google will land some variant of this first (grpc, cloud services and stackdriver instrumentation).

Similar to other things that happen, when that demand occurs it is usually in a repo or two. For example, our first requests for StackDriver and X-Ray trace support came from sleuth issues list. Tracers run independently and can move to support something sooner or later. I often ping people across tracers on things like this so that they can weigh-in before organic demand hits.

Regardless, support or not support lies in the scope of each tracer to decide until there are server implications like the trace context is incompatible or too wide to store in zipkin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current format (that is compatible with ZIpkin and Google) we try to make at least these systems to work with this format. Having a common place for the specs (maybe some simple implementation in multiple languages) is one of the goal.

Anyone who is interested in using this format is welcome to join the effort and send patches/PRs etc.

@codefromthecrypt
Copy link

@saintstack @reta any feedback on this from htrace pov?


## Header name

`Trace-Context`
Copy link

@costinm costinm May 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Via' is a standard header with pretty similar semantics. It is also in hpack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trace-Context can be mistaken with something carrying the baggage or context properties. This is id. We decided to use Request-Id for correlation protocol in .NET. If we will be able to converge on format - it may be a good one to use.

Request-Id is quite descriptive for the header purpose I think.

Copy link
Contributor Author

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL adriancole@

@@ -0,0 +1,88 @@
# Trace Context HTTP Header Format

Date: 31/03/2017
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

#### Trace-id

Is the ID of the whole trace forest. It is represented as a 16-bytes array,
e.g., `0x4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Implementation may decide to completely ignore the trace-context if the span-id
is invalid.

#### Trace-options
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 1-byte used for options.

#### Trace-options

Controls tracing options such as sampling, trace level etc. It is a 4-bytes
representing a 32-bit unsigned integer in little-endian order. The least
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 1-byte used for options.

*Valid sampled Trace-Context:*

```
Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Used a delimiter between all fields.

Version = 0 (0x00)
TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736
SpanId = 0x00f067aa0ba902b7
TraceOptions = 0x0 // not-sampled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more endianness.

I like the idea of maybe having multiple supported systems, but one of the main purpose is to make all these systems work together so having separate encodings for each system will not be that much of a win.

1. Trust and abuse.
2. Bug in caller
3. Different load between caller service and callee service might force callee
to down sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming sampling logic may differ much between vendors - why place sampling flag to the trace context? Why not pass it as a separate header that is subject for vendor-specific logic? In some cases to resolve the issues you describe - extension libraries will need more than just a sampling flag. Most probably some additional information from baggage or other headers.

You may also have multi-tier sampling. Take an example of local agent mode that @bogdandrutu demo-ed for census. You may want to sample data that you send to backend with sampling rate 0.01 and locally - 0.1. So you'd need more than one bit to carry both sampling decisions.

Another consideration - properties of Trace-Context belongs to the new span you create. So developer or extension library can access those fields by taking the "running" span details. You can use those identifiers for the non-tracing needs. Sampling flag will not be one of a span properties (will it?). So you have no access to the original value of that flag. And you may not need any options as you may only care about the identifiers. So having options in Trace-Context looks inconsistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev consistent sampling across the whole trace is very important characteristic. If the sampling decision is not propagated and the trace spans multiple implementations t1,...,tN, then each implementation will have to make a sampling decision over and over, and the probability that you will capture the whole trace becomes very small, p1 * ... * pN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro if you are making a sampling decision based on traceid you will have min(p1, p2, .. pN) number of full traces, not the multiplication. If all probabilities match - all the traces will be collected fully,

When you do sampling you may need to estimate the count of spans exhibit certain properties based on sampled data. In case of statistical sampling you may just multiple the raw count of spans to sampling percentage to get statistically accurate number.

If sampling decision forced from above without the information on sampling percentage of originator - you cannot do this type of estimations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statistical sampling on every layer is just one example of different type of sampling you may want to implement and you will need more than a bit of informaiton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are making a sampling decision based on traceid...
When you do sampling you may need to estimate the count of spans

Having the sampling bit does not prevent you from passing extra data or making more elaborate decisions, but it's not part of the proposed standard. NOT having the sampling bit pretty much guarantees that the trace will be broken, unless every implementation makes the decision based on the exact same formula, like traceId < p * 2^128. which is again not a part of the proposed standard.

Sampling bit is a recommendation. If a service can respect it and handle the volume - great. If it cannot respect it 100%, maybe it can respect it for "more important" spans like RPCs, and shed the load by dropping in-process spans & metrics. Essentially it does not put any restrictions on how you want to implement sampling, but still provides a way to achieve consistent sampling across the stack, if the volume allows it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is precisely my question. Do you think it will be typical for libraries to trust this flag? Or every vendor and library will implement own logic so this flag will never be used? Should this standard define the flag or let customers decide on sampling algorithm across services in their org as a decision separate from the data correlation protocol?

Beauty of sampling flag is ability to implement solutions like forced data collection for period of time or specific span name. Also it removes the need to synchronize the sampling decision algorithm.

On negative side - services looses control of the data volume and statistical accuracy of collected data.

Protocol is optimized for a single team owning many components with relatively similar load. It is not always the case. Every component may be owned by a team which want to play nice and contribute to the overall correlation story. But have a bigger priority to fully control telemetry volume and distribution collected from this component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On negative side - services looses control of the data volume and statistical accuracy of collected data.

I don't agree with this assessment - the flag is recommendation, an implementation does not have to respect it if it thinks it can do a better job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it will be typical for libraries to trust this flag?

Yes, I think it's a very common implementation to simply trust the flag, in the absence of other knowledge about the system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not against the flag that will force all layers to record a specific trace. It is very useful for troubleshooting scenarios.
  2. I also agree that collecting the entire trace as often as possible is very useful.
  3. It is unavoidable that downstream services will collect more telemetry than was sampled in by front door. So there will be incomplete traces.

What I want to avoid is the situation when you heavily rely on implementation detail of upstream component sampling algo. Also I want to make sure there is an easy to replicate on any language mechanism to control the flood of telemetry in case of non-matching load patterns. Third, for many Application Insights scenarios we need to keep the sampling percentage that was used to sample telemetry out so we can run statistical algorithms on telemetry and recognize patterns. So single bit will not generally work for us as a universal sampling mechanism.

I'd propose to have a debug flag with the semantic of occasional forceful tracing across the system. Sampling flag then can be vendor specific.


### Version

Is a 1-byte representing a 8-bit unsigned integer. Version 255 reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rationale here is that we expect the key to remain the same even if the value changes. This way, we can change format in worst case where such a thing is needed, and that can be done without a fan out of administrative activity such as new filter patterns to afford a new trace header key. We should update this to make very clear that version changes are not expected and highly discouraged, especially breaking ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriancole I really like an explicit versioning. I think it will help a lot long term

@bogdandrutu was the idea of the version an incremental thing or the format of the header? Let's say 00 is what explained here, 01 - binary format, 02 - alternative length of request ID or other implementation? We've been thinking suggest to use it to indicate the alternative way of generating identifiers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev cool. I am in favor of versioning at this point, notably to help folks know when to not process a header (ex check magic type of thing likely used similarly in X-Ray's format).

Wrt version also being a format flag, not sure it applies here. For example, in grpc, binary headers are actually encoded as base64, and their header names have -bin appended to them. So in this case, they don't need a different format bit inside their encoded trace data as it is already implicit in the scheme.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a service is called from device via http with the size concern and from another service with the regular header format? This service need to read two headers? Which one will win?

When you have a single header and allow for binary format - things got easier I believe

@codefromthecrypt
Copy link

codefromthecrypt commented Jun 28, 2017 via email

@codefromthecrypt
Copy link

codefromthecrypt commented Jun 28, 2017 via email

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jul 3, 2017

I'm trying to summarize the main difference between this format and other formats:

This format vs. B3:

  • one header vs. multiple headers
  • more RFC-friendly name as prefixes are deprecated
  • versioning support
  • mixing identity and flags in the same header. In B3 you can pass identity without specifying any flags
  • no sharing of span ID between client and server

This format vs. AWS. The best link I found:

  • machine, not human friendly. No names of fields as Amazon header has
  • more RFC-friendly name as prefixes are deprecated
  • not additive as amazon header. Amazon header do not follow RFC as it uses ;, not a comma:

    It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma

  • do not include time apart from Amazon
  • trace-id longer than Amazon's trace ID (it seems that Amazon uses the same length id, but encodes time and version into it), span-id shorter than Amazon's Self-ID

This format vs. .NET proposed:

  • non-hierarchical identifiers. No way to establish sequencing without the whole trace
  • strict requirements on identifiers format. .NET supports strings
  • mixing identity and flags in the same header
  • versioning supported
  • machine, not human friendly. No names of fields

I'm trying to get sense of why to use the new format, not use one of existing. I still think it would make sense to detach sampling/debug flag from the identity information. Support of less strict identifiers can also help integrating with other monitoring systems. I'll post another comment on how the analog of this format can be merged with .NET proposed.

@SergeyKanzhelev
Copy link
Member

Another typical case is Heroku with X-Request-ID (example of another typical case. I even found this document) ). Differences with this proposal:

  • Strict format. Heroku sets no limits on format. Just a base64 string.
  • Allows to hierarchy by sending two identifiers
  • more RFC-friendly name as prefixes are deprecated

@bhs
Copy link
Contributor

bhs commented Jul 5, 2017

@SergeyKanzhelev thank you for the brief survey above.

The more I think about this issue the more I like the idea of leaving the precise bit-width and even the encoding of the trace and span ids out of this spec... why can't they just be strings? (I.e., from a passing-from-process-to-process standpoint, the encoding doesn't matter as long as the (opaque) trace/span ID can be extracted)

@yurishkuro
Copy link
Member

@bhs I think many existing tracing systems won't be able to support trace ID as arbitrary string, but I think this is actually an important question about this spec: is it meant to enforce a single trace ID across different tracing backends, or just provide a correlation between different sub-graphs? At the last Zipkin workshop it was mentioned that blindly respecting a foreign trace ID is dangerous, since a bad player could send the same ID over and over, messing up all of the internal traces for another sub-system. On the other hand, treating it simply as a correlation ID (foreign reference) is safe and does not require any conventions about the bit-length and encoding.

@bhs
Copy link
Contributor

bhs commented Jul 5, 2017

@yurishkuro

I think many existing tracing systems won't be able to support trace ID as arbitrary string

I absolutely agree – my idea here is less that all tracing systems should support an arbitrary string, and more that the tracing system should have control over what a trace ID looks like. I'd be fine restricting it to hex or base36 or whatever if that makes people feel better about it. I just don't see the ROI in trying to enforce a particular bit width, etc. Interop between tracing systems within a single application deployment has always been a non-goal of mine. I'm more interested in allowing the routers and service meshes (and other L7 tech of the world) to pass tracing information along properly.

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 6, 2017 via email

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 6, 2017 via email

@SergeyKanzhelev
Copy link
Member

@adriancole I understand frustration. I still have questions to this format. We also thinking how it can be converged with .NET http correlation protocol and working on proposal. That's why all the questions.

My understanding was that the idea of this proposal was to set a standard for all things tracing, not limit it to Census and Zipkin use case.

@mtwo, @bogdandrutu it will be great if you can summarize the open questions from this thread and plans to merge and evolve the format with Census (I'll also try to reach you directly tomorrow).

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 12, 2017 via email

@bhs
Copy link
Contributor

bhs commented Jul 13, 2017

I likely don't have time to finish this argument, but I do want to re-raise the fact that the benefits of specifying a bit width for ids are in my mind outweighed by the benefits of not-specifying a bit width. The path-id thing is a perfect example... I actually implemented something like that for a few recent demos and expect to see something similar in production before too long; I would love to use a standard inter-process propagation format, but that would be a non-starter with the fixed-width id scheme (at least without some arbitrary baggage map to stick things in instead; though I can understand leaving something that generic out of the spec for now).

I think the current proposal is utterly sane but too narrow. What we'll end up with is yet-another-standard for the inter-process propagation (and maybe even a decent one), but a standard that's too narrow to absorb the semantics of other popular propagation formats and thus one that's limited. I.e., I like it, but I'd like it more if it only placed the restrictions that are literally necessary: since the ids aren't parsed, who cares how wide they are?

@SergeyKanzhelev
Copy link
Member

@bhs I was hoping for the same. I like some ideas in this format though. Like format number prefix and clean dash-delimited string.

In the most distributed systems I see one header for always changing unique string and one for something you need to keep along the entire trace. Most of the time first is being optimized to have a prefix of the <trace-id> so you do not need a separate header for a single value. It will be great to have something like this. Thinking out loud:

Request-Id: <version>-<trace-id>-<whatever you want>
Baggage(/Context?):<key>=<value>,<key>=<value>,<key>=<value>

I've seen older systems that do not do <trace-id> and rely on guaranteed delivery. However now everybody have <trace-id> so it may be part of a standard till artificial intelligence will come up with something more advanced for distributed tracing :-). I think it's safe to agree on it's length either.

This way as I mentioned in #4 you can force all "dumb http services" to simply trace the content of a header with the specific name alongside the logs or pass it forward if it is a proxy of sort.

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 13, 2017 via email

@mtwo
Copy link
Contributor

mtwo commented Jul 13, 2017

I agree with Adrian. Two of the goals of this effort benefit from a common format encoding:

  1. Instrumentation libraries can be more easily shared between tracing vendors. Without a common context format, any kind of vendor-neutral instrumentation library would have to provide a way for vendors to inject their own context propagation scheme, which introduces a fair amount of complexity.

  2. PaaS providers can match up customer traces to their own. For example, if I'm managing a hosted database service and a customer is complaining about poor performance, and I'm propagating customer trace IDs and generating traces on them through my service, then I can debug a customer's request if I'm given a specific trace ID. This has particular benefits to Microsoft, Google, Amazon, and other platform providers, but it also benefits any tracing system that can uses the same format as their customers can provide trace IDs to their provider for debugging purposes.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jul 13, 2017

@adriancole @mtwo the important thing is semantic. One header should be unique for every request and another is constant for a trace. If you have a string which semantic is trace-id you pass it forward and use for correlation. If you have a string which semantic is request-id - generate a new one for outbound connection. For baggage collection - parse if you know how. If not - pass forward. So every vendor can respect semantic and use the values of the headers even without understanding it.

Furthermore - if you recognize the format of the header you can apply some additional logic on it. So having similar instrumentation and format will allow for an extra features. Like ordering or such.

In Application Insights we are using strings to store ID for this specific reason. We have no control over existing instrumentation and allow to use pre-existing correlation practices alongside the Application Insights SDK.

@yurishkuro
Copy link
Member

I tend to agree with @SergeyKanzhelev from a philosophical point of view, even though it would be a breaking change in Jaeger, which currently, similar to Zipkin or StackDriver, only works with fixed-width IDs. There is no strong reason for those IDs to be fixed-width, aside from minor optimization for sampling where the ID can be treated as a random integer and one can do sampled := id < MAXINT * samplingProbability (but not in all languages), and potentially some storage restrictions. Semantically, an opaque variable length string would work just as well for traceID an spanID, and would actually allow for higher degree of interoperability across different implementations, which is important for a standard.

@SergeyKanzhelev
Copy link
Member

@yurishkuro do you use id or trace-id for score generation? I think trace-id that often is used for sampling score generation is less a controversy. It may not be a huge problem to make it a 128bit number in specific format, not just a string in this spec.

In Application Insights we currently using EJB hash to calculate sampling score (.NET, JavaScript). If we will know that it is most probably just a 128bit number - it can make life easier for the score generation - this formulae can be used sampled := trace-id < MAXINT * samplingProbability for such cases.

@bhs
Copy link
Contributor

bhs commented Jul 13, 2017

@mtwo I want to be clear that I'm not arguing against a common trace format. It has obvious advantages, and you've listed two great ones.

My concern is that this standard will not take off in practice because it is not semantically general enough to represent the trace context for certain tracing systems.

I've spoken at length about this with the Envoy team (cc @mattklein123) and with a variety of organizations (large and small) that depend on things like service meshes, L7 routers, etc. The requirements I've been hearing are basically (1) a truly standard header name (or "names" per @SergeyKanzhelev's points), (2) some way to separate the trace_id from the span_id, and perhaps a way to make sense of standard (e.g., sampling bits) and user-specific propagated state (aka baggage). For the purposes of these use cases, actually specifying the encoding of the ids is not important.

I should mention that getting the service mesh stuff to work right end-to-end also requires some standardization of how to serialize what I typically refer to as "out-of-band" trace data, much like Zipkin issue #1499, though ideally not tied to zipkin (just as this tracecontext-spec discussion is not tied to zipkin).

Stepping back, my hope is to pair a minimalist, generic in-band-trace-context format and a minimalist, generic out-of-band-trace-data format and allow things like service meshes (etc) work with as many tracing systems as possible without code changes in those service meshes (etc). cc += @opentracing/otsc @opentracing/otiab as there has been some discussion of wading into these waters given this topic's increasing importance.

Getting back to @mtwo's vision of trace handoff between apps and PaaS/IaaS providers, I think it's appealing but also problematic in practice: my personal conversations with big-3 providers suggest there's little chance that app developers are going to be given access to traces of the internals of these managed services, many of which are proprietary.

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 14, 2017 via email

@yurishkuro
Copy link
Member

yurishkuro commented Jul 14, 2017

@adriancole but what if the standard addressed both goals? Putting flags and baggage aside for a moment, what if it simply standardized on two headers (or one, with well defined value separator) to represent trace and span IDs as opaque strings, with a provision that when those IDs are fixed-width byte arrays they should be hex-encoded?

  • if both sending & receiving tracing systems use compatible format of IDs (either opaque string or fixed-width), then interop is achieved
  • if receiver works with opaque strings, then interop is also achieved
  • if sender uses strings but receiver uses fixed-width ID, interop is not possible, which is no worse than if this standard did not exist. It's actually better that nothing because at least the receiver can still log the incoming ID and support some sort of correlation lookup (according to Bogdan, the "big providers" are actually unlikely to trust external ID anyway).

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 14, 2017 via email

@yurishkuro
Copy link
Member

I think we're trying to solve only one problem - interop. In the current form the proposal does not solve the problem of interop any more than already can be achieved with B3 headers, which are already a standard on their own, as you point out. I, too, prefer to close PRs quickly, but clearly there are many tracing systems that won't be able to use fixed-width byte arrays, so instead of unifying we'll be fragmenting the space even more.

Do you think what I proposed above creates a serious issue for Dapper-like systems? To me it seems very doable to support that format.

@codefromthecrypt
Copy link

@yurishkuro I do not think creating a draft and proving it works with fixed length before widening scope is fragmenting. It is simple iteration, and separates problems. Just like someone wanting Elasticsearch in jaeger can follow the issue until that's resolved or otherwise, someone interested in variable length ids can follow that issue as well. It is simply a feature on one hand, but requires work on the other. Meanwhile, jaeger has interop tests for your trace context format. It is unfair to limit good testability to just your project.

@codefromthecrypt
Copy link

@yurishkuro @SergeyKanzhelev I opened an issue on opaque ids since it seems that's the only way to get one opened. #6 Please make an effort to raise separate issues for separate features. Hijacking this issue for an unbounded scope isn't an effective way to progress work and results in circular discussion. We all have limited amounts of time and doing real work with that time is better than cluttering this PR.

@bogdandrutu
Copy link
Contributor Author

@yurishkuro for trusting because I saw multiple mentions of that, I said that the system at the edge may not trust the trace_id, but what we will do is still record them as Links for example (see https://github.com/census-instrumentation/opencensus-java/blob/master/api/src/main/java/io/opencensus/trace/Link.java), so in oder words you don't directly use the same trace_id but you record it to keep the relationship/correlation. Also this is just a possible implementation, and for sure an implementation detail.

@yurishkuro we should also keep in mind that we would like to use the same format even when we send the trace context between processes in the same organization where we want to keep the same trace-id, so understanding the format is critical there.

@adriancole I think the idea to move forward with this proposal and start having working examples between different tracing systems/libraries will prove (or not) the current work.

@SergeyKanzhelev + @bhs your main concern is that we do not specify a length for the span-id/trace-id? Are there any other major concerns? If we add that, is it enough for you to use it?

@bhs you did not specify the format that you use in LightStep, is that available publicly?

@bhs
Copy link
Contributor

bhs commented Jul 14, 2017

@bogdandrutu

@bhs you did not specify the format that you use in LightStep, is that available publicly?

Kinda? LightStep is quite adaptable and can downgrade functionality to accommodate many other trace formats, including weird in-house ones that don't have a public presence. That said, our product roadmap involves protocol changes which are strictly incompatible with the tracecontext proposal. I'm not suggesting that you need to make changes to accommodate LightStep, but since you ask, there's your answer.

@adriancole, I understand your desire to merge something. IMO the discussion is sprawling here because the file we're actually reviewing starts off with this:

Our goal is to share this with the community so that various tracing and diagnostics products can operate together, and so that services can pass context through them, even if they're not being traced (useful for load balancers, etc.)

If this is indeed designed for "various tracing and diagnostics products", it only seems natural for those tracing and diagnostic products to explain where there are problems. Bogdan brought an earlier version of this proposal up at the workshop in Berlin earlier this year, and IIRC there were a lot of semantic mismatches when we got down to brass tacks. I want to be ultra-clear that I am actually quite enthusiastic about this work, but I also want to be ultra-clear that I think it would be more successful and adopt-able if it was one notch less opinionated about things that don't actually matter right now in practice (again, like bit widths).

@bogdandrutu

@bhs your main concern is that we do not specify a length for the span-id/trace-id? Are there any other major concerns? If we add that, is it enough for you to use it?

Variable-length ids is one thing (btw, there's no need to specify a length IMO... it'd be fine in my view to just delimit fields and restrict ids to alphanumeric strings). Other problems are:

  • no key:value userdata (i.e., "baggage")
  • I would love a place to stash the sampling rate (or reciprocal sampling rate), though given baggage support we'd be there already
  • the sampling bits do not match our model, though I can overlook that / ignore those bits

If you're asking me (you're not, I know :), I would be excited about a format that has the version-ish stuff exactly as you've proposed it, alphanumeric ids for trace and span ids separated by -, then a free-form map for baggage, possibly separated between Tracer-internal state which must be propagated and user state which "should" be propagated, best-effort.

@codefromthecrypt
Copy link

codefromthecrypt commented Jul 14, 2017 via email

@bhs
Copy link
Contributor

bhs commented Jul 14, 2017

Most dapper based systems do not store IDs as alphanumeric strings, rather 64 or 128bits.

Yes, agreed, but the store-ing of the IDs and the transmission of the IDs are separate concerns. We are already talking about sending ids as hexadecimal ASCII, which is (obviously) not the in-memory representation in most tracing systems. I feel like we're arguing in circles about this... as @bogdandrutu mentioned, cloud-managed services would most likely just record the inbound ids and not even use them "natively" for tracing, so the length restriction matters even less than previously argued AFAICT.


## Field value

`base16(<version>)-<version_format>)`
Copy link

@BRMatt BRMatt Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be base16(<version>)-base16(<version_format>), or base16(<version>)-<version_format>?

@bogdandrutu
Copy link
Contributor Author

I am going to merge this pull request and address open issues in separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet