[WIP] Library instrumentation guidance review #917

lmolkova · 2018-05-19T00:55:37Z

Fix Issue #913.

Updating library instrumentation guidance per feedback.

This is a work in progress and there are many TODO/TBD items in the doc.
Consumer guide will be started separately.

vancem · 2018-05-22T22:08:42Z

LibraryInstrumentationGuidance.md


 ```csharp
+    // create DiagnosticListener with unique name
    static DiagnosticListener source = new DiagnosticListener("Example.MyLibrary");


I would change Example.MyLibrary, to something like 'NameSpace.ComponentName' to give people guidance on what a good name looks like

This static variable should be explained in a separate code block IMO to better differentiate it from the actual instrumentation code.

vancem · 2018-05-22T22:13:44Z

LibraryInstrumentationGuidance.md

-        activity = new Activity("Example.MyLibrary.MyOperation");
-        source.StartActivity(activity, new { Input = input });
+        activity = new Activity("MyOperation");
+        activity.Start();


You can simplify to

activity = new Activity("MyOperation").Start();

But we probably want to make a helper 'DiagnosticListener.StartIfEnabled(string)' which does the lines above (and inlines), and thus the guidance is to do

Activity activity = source.StartIfEnabled("MyOperation")

See other comment. Firing the "Start"-event should be a MUST. It's a complete mess if this is optional and only done by a few libraries.

Having "activity.Start()" and "source.StartActivity()" is very confusing anyway and I too think that this should be simplified somehow in the API.

vancem · 2018-05-22T22:15:04Z

LibraryInstrumentationGuidance.md

+    }
+    catch (Exception ex)
+    {
+        activity?.AddTag("error", true);


Tags values are strings so this has to be

activity?.AddTag("error", "true");

We should probably add special helpers like 'AddErrorTag' for canonical tags as we find they are very generic. That makes the code simpler and allows Activity to encode it more efficiently in the future.

having 'status' on the Activity would be great once we agree on the contract!

TL;DR;
So, I would recommend setting a status code and optionally sending exception event.

Status should still provide "OK" or "something else" (like OpenCensus does). Having strict enum-kind of mapping could be an option too, but it seems hard to do this mapping for each exception in the library

Since the result of operation is special and essential, I think we should have a Status property (and SetStatus(string) method) on the Activity.

I put "error" as a placeholder and do not think we should use it. Boolean value as OpenTracing suggests here does not seem to be informative enough: HTTP, gRPC and perhaps other communication would rather provide a status code. OpenCensus suggests status code + details.

While it is useful to stick to the tags suggested by OpenTracing, I don't really understand the guidance there as well.

@cwe1ss, could you please advise us, what is the OpenTracing recommendation for errors

I see suggestion to use

event="error" error.object=<error object instance>

where possible. What about generic (non http) status recommendation in the tags?

This recommendation translates to DiagnosticSource 'Exception' event rather than a tag on the Span/Activity.
After working with exception events for a while, I can see an issue with them:
We expected someone would be interested in exceptions only without Activities. It seems tracking ALL exceptions without context is not that useful - it could mean overwhelming amount of handled exceptions and we hear users are interested mostly in unhandled ones.
Because of the same assumptions they complicate instrumentation a bit.

Status code

OpenTracing currently only defines the "http.status_code" header and no non-http status codes. As there's no real protocol-independent standard (or none that I know of), I doubt that it makes sense to have a generic "status" tag/property. Instead, I think that libraries that have status codes should add their own tags - e.g. "grpc.status_code".

Errors

As far as I understand, the OpenTracing "error" tag is for "Exception"-like cases, yes. Most OpenTracing instrumentation I've seen adds the "Exception" details via a span.Log( "event"="error", "error.object"=exception ) call (pseudo code).
Of course, the question is, whether this a case of the "log & throw" anti-pattern. If there's a parent span/activity it might log the same exception and re-throw again.

Not sure what the best solution is here. The instrumentation code could only set the "error" tag and not log/raise the exception. This means it would rely on some parent span/activity do log it but what if there isn't one? Also, it might make debugging harder as the exception details are associated with an activity/span that did not raise it. Imagine there's a retry-scenario where something fails twice and succeeds on the third attempt. Who's responsible for logging the exceptions?

There's also a difference between OpenTracing and Activity because with OpenTracing's "Log" concept, the exception always stays with the span whereas this is more decoupled in Activity where it currently can only be done via a separate DiagnosticSource event.

vancem · 2018-05-22T22:26:13Z

LibraryInstrumentationGuidance.md

    }
    finally
    {
        // stop activity if started
        if (activity != null)
-             source.StopActivity(activity, new { Input = input, Output = output }); 
+             source.StopActivity(activity, new MyStopPayload { Input = input, Output = output }); 


We should have a discussion about whether we should guide people to use explicit payload classes. They can improve efficiency and ease of use on the subscriber side, HOWEVER, they also are a mildly painful to use and do couple the receiver to the exact version of the sender. We do have 'fast' reflection APIs (see PropertyFetch class in DIagnosticSourceEventSource) that we can publically expose. These are 'pretty fast (< 10 instructions to fetch), and are probably good enough.

If that is not sufficient, and we really want to recommend everyone making their own types, you should include the definition of MyStopPayload in the example and we need give guidance on naming (the type should live in the YOUR_COMPONENT.DiagnosticSource namespace) and we need to describe versioning, and the fact that using Properties (not fields) is better.

So we had a bit of discussion about it with @NickCraver in StackExchange/StackExchange.Redis#833 and https://github.com/dotnet/corefx/issues/18114.
I believe his concerns are 1) hard to use 2) no contract and compile time checks.
I agree with the first one, but not the second: we have to keep events and anonymous payloads backward compatible anyway and in reality for end-user application there is no compile time check.

It seems we do recommend using explicit types in DiagnosticListnener guide already.

Property fetching: we have Microsoft.Extensions.DiagnosticAdapter and while it does efficient payload parsing, it does not play well with Activities/sampling scenarios and deprecated events.
So we should either invest in DiagnosticAdapter or deprecate it and expose Property Fetcher.

Anyway, I agree we need more info about naming/versioning. I will add guidance from DiagnosticListnener guide. Eventually we can convert this doc +diagnostic listener + activity guide to several articles on docs.ms.com

It would be great if "PropertyFetcher" became a first class citizen. I tried the DiagnosticAdapter project but I removed it again because I thought it was too complicated and of limited use. Using a switch on the event names and the PropertyFetcher stuff isn't much less readable.

vancem · 2018-05-22T22:34:53Z

LibraryInstrumentationGuidance.md

+
+### Context Propagation
+
+In distributed systems, one user operation may be processed by many applications. To trace operation end-to-end, some context must be propagated along with external call.


This needs more meat. Basically an example (or leave it out, because as it is it does not help anyone).

You know my concerns about having hard-coded header names etc. in instrumentation code. We should have a proper discussion about this at some point. Until then, "context propagation" should not be documented IMO.

vancem · 2018-05-22T22:37:43Z

Thanks @lmolkova for driving this. I agree with the general community that we need more investment in getting good instrumentation

Are we saying that we have some examples of this now in Azure code? If it is open source, we should put the urls here as examples are useful (and I would be interested in seeing them).

cwe1ss · 2018-05-25T15:42:07Z

LibraryInstrumentationGuidance.md


 ```csharp
+    // create DiagnosticListener with unique name
    static DiagnosticListener source = new DiagnosticListener("Example.MyLibrary");


This static variable should be explained in a separate code block IMO to better differentiate it from the actual instrumentation code.

cwe1ss · 2018-05-25T15:45:04Z

LibraryInstrumentationGuidance.md

+**CONSIDER** providing more context via 'IsEnabled(OperationName, Input)' overload instaed of 'IsEnabled(OperationName)'
+In some cases tracing system needs more context to filter operations. A good example is tracing system that tracks HTTP calls and sends telemetry over HTTP as well. To avoid recursive tracing and save performance tracing system want to prevent such calls to be instrumented at all based on some request properties.
+
+**CONSIDER** firing 'MyOperation.Start' event


Firing the "Start" event should be a MUST in my opinion. Having only some libraries that do this is a complete mess. It also completely breaks adapters that must do stuff on every activity start event.

cwe1ss · 2018-05-25T15:46:00Z

LibraryInstrumentationGuidance.md

-        activity = new Activity("Example.MyLibrary.MyOperation");
-        source.StartActivity(activity, new { Input = input });
+        activity = new Activity("MyOperation");
+        activity.Start();


See other comment. Firing the "Start"-event should be a MUST. It's a complete mess if this is optional and only done by a few libraries.

cwe1ss · 2018-05-25T16:05:02Z

LibraryInstrumentationGuidance.md

-        activity = new Activity("Example.MyLibrary.MyOperation");
-        source.StartActivity(activity, new { Input = input });
+        activity = new Activity("MyOperation");
+        activity.Start();


Having "activity.Start()" and "source.StartActivity()" is very confusing anyway and I too think that this should be simplified somehow in the API.

cwe1ss · 2018-05-25T16:32:18Z

LibraryInstrumentationGuidance.md

+    }
+    catch (Exception ex)
+    {
+        activity?.AddTag("error", true);


Status code

OpenTracing currently only defines the "http.status_code" header and no non-http status codes. As there's no real protocol-independent standard (or none that I know of), I doubt that it makes sense to have a generic "status" tag/property. Instead, I think that libraries that have status codes should add their own tags - e.g. "grpc.status_code".

Errors

As far as I understand, the OpenTracing "error" tag is for "Exception"-like cases, yes. Most OpenTracing instrumentation I've seen adds the "Exception" details via a span.Log( "event"="error", "error.object"=exception ) call (pseudo code).
Of course, the question is, whether this a case of the "log & throw" anti-pattern. If there's a parent span/activity it might log the same exception and re-throw again.

Not sure what the best solution is here. The instrumentation code could only set the "error" tag and not log/raise the exception. This means it would rely on some parent span/activity do log it but what if there isn't one? Also, it might make debugging harder as the exception details are associated with an activity/span that did not raise it. Imagine there's a retry-scenario where something fails twice and succeeds on the third attempt. Who's responsible for logging the exceptions?

There's also a difference between OpenTracing and Activity because with OpenTracing's "Log" concept, the exception always stays with the span whereas this is more decoupled in Activity where it currently can only be done via a separate DiagnosticSource event.

cwe1ss · 2018-05-25T16:45:14Z

LibraryInstrumentationGuidance.md

    }
    finally
    {
        // stop activity if started
        if (activity != null)
-             source.StopActivity(activity, new { Input = input, Output = output }); 
+             source.StopActivity(activity, new MyStopPayload { Input = input, Output = output }); 


It would be great if "PropertyFetcher" became a first class citizen. I tried the DiagnosticAdapter project but I removed it again because I thought it was too complicated and of limited use. Using a switch on the event names and the PropertyFetcher stuff isn't much less readable.

cwe1ss · 2018-05-25T16:47:05Z

LibraryInstrumentationGuidance.md

+
+**DO** make `DiagnosticListener` name globally unique.
+Every DiagnosticListener has a name. Tracing system discovers all available sources and subscribes to them based on the name. 
+[DiagnosticSource guide](https://github.com/dotnet/corefx/blob/master/src/System.Diagnostics.DiagnosticSource/src/DiagnosticSourceUsersGuide.md#naming-conventions) provides more naming considerations.


As I've said in #913, this document should be merged with the existing UserGuides.

cwe1ss · 2018-05-25T16:53:00Z

LibraryInstrumentationGuidance.md

+
+### Context Propagation
+
+In distributed systems, one user operation may be processed by many applications. To trace operation end-to-end, some context must be propagated along with external call.


You know my concerns about having hard-coded header names etc. in instrumentation code. We should have a proper discussion about this at some point. Until then, "context propagation" should not be documented IMO.

…exception event added, to be continued

cwe1ss · 2018-06-04T19:41:30Z

@lmolkova thank you for incorporating my feedback!

ReubenBond · 2018-09-17T11:26:51Z

LibraryInstrumentationGuidance.md

+**DO** Create one `DiagnosticListener` instance for a particular source (make it static): each time when new DiagnoscticListener is created, all subscribers are notified and about it.
+
+**DO** call 'IsEnabled()' for every operation.
+The first `source.IsEnabled()` call efficiently tell if there is any listener in the system for our source. 


s/tell/tells maybe

ReubenBond · 2018-09-17T11:27:13Z

LibraryInstrumentationGuidance.md

+
+**DO NOT** cache results of IsEnabled: tracing system may subscribe/unsubscribe at any moment. Results of IsEnabled are dynamic and depend on the context.
+
+**CONSIDER** providing more context via 'IsEnabled(OperationName, Input)' overload instaed of 'IsEnabled(OperationName)'


vancem

I am OK with this guidance. There is the question of how people find it, but that should not block this checkin

lmolkova · 2018-09-18T02:52:15Z

I have not provided an update in a while...
We've been working on investigating and evaluating partnership with OpenCensus: https://open.microsoft.com/2018/06/13/microsoft-joins-the-opencensus-project/
vendor-agnostic polyglot libs and tools for distributed tracing.

We are looking into providing common guidance for lib instrumentation later this year. We still recommend instrumenting .NET libraries with DiagnosticSource and Activity but want to have a consistent story in a polyglot environment, and we are still finalizing details.

@ReubenBond thanks for the review. I'll fix the typos and merge it.

@vancem thanks! there are a lot of TBDs in the guidance, so it's not really ready for docs.microsoft.com. I'll work on this once we finalize OpenCensus story.

TimothyMothra · 2019-08-01T18:47:40Z

@lmolkova should we close this?

Library instrumentation guidance review

cd4b50a

vancem reviewed May 22, 2018

View reviewed changes

cwe1ss reviewed May 25, 2018

View reviewed changes

some feedback: StartOperation required, Context propagation removed, …

9b8fd2e

…exception event added, to be continued

TimothyMothra added the documentation label Jun 4, 2018

up

4899086

cwe1ss mentioned this pull request Jul 2, 2018

Actor tracing akkadotnet/akka.net#2027

Closed

ReubenBond reviewed Sep 17, 2018

View reviewed changes

NickDarvey mentioned this pull request Sep 17, 2018

Distributed tracing and Orleans dotnet/orleans#4992

Closed

vancem approved these changes Sep 17, 2018

View reviewed changes

Liudmila Molkova added 3 commits September 17, 2018 19:54

fix typos

4e27fb5

Merge branch 'develop' into lmolkova/guidance

e3b689e

Merge branch 'develop' into lmolkova/guidance

b3a1d3a

TimothyMothra closed this Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Library instrumentation guidance review #917

[WIP] Library instrumentation guidance review #917

lmolkova commented May 19, 2018 •

edited

vancem May 22, 2018

cwe1ss May 25, 2018

vancem May 22, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

vancem May 22, 2018

vancem May 22, 2018

lmolkova May 23, 2018

cwe1ss May 25, 2018

vancem May 22, 2018

lmolkova May 23, 2018

cwe1ss May 25, 2018

vancem May 22, 2018

cwe1ss May 25, 2018

vancem commented May 22, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss May 25, 2018

cwe1ss commented Jun 4, 2018

ReubenBond Sep 17, 2018

ReubenBond Sep 17, 2018

vancem left a comment

lmolkova commented Sep 18, 2018

TimothyMothra commented Aug 1, 2019


		### Context Propagation

		In distributed systems, one user operation may be processed by many applications. To trace operation end-to-end, some context must be propagated along with external call.


		DO NOT cache results of IsEnabled: tracing system may subscribe/unsubscribe at any moment. Results of IsEnabled are dynamic and depend on the context.

		CONSIDER providing more context via 'IsEnabled(OperationName, Input)' overload instaed of 'IsEnabled(OperationName)'

[WIP] Library instrumentation guidance review #917

[WIP] Library instrumentation guidance review #917

Conversation

lmolkova commented May 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Status code

Errors

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vancem commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Status code

Errors

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwe1ss commented Jun 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vancem left a comment

Choose a reason for hiding this comment

lmolkova commented Sep 18, 2018

TimothyMothra commented Aug 1, 2019

lmolkova commented May 19, 2018 •

edited