Tracing System Overview #3

Hujun · 2018-05-09T09:22:15Z

Tracing system has become nowadays a must-have infrastructure besides long existing logging system. It is not surprising since micro-service architecture has being introduced during last decade (nevertheless a lot of developers and so-called architects do not exactly know to implement it). This article aims to give a clear outline of tracing systems from views of features, architectures, pros and cons by analysis the most famous tracing system, which are Dapper, Zipkin, Jaeger and OpenTracing.

Google Dapper

When we talk about tracing system in context of micro-service architecture, it is impossible to ignore the famous Google Dapper paper. It is so important not only because it's the first practical large-scale distributed tracing system running in complex production environment, but also because of key features, definitions and best practices it unveiled.

Terms

I guess you have heard or seen "span" when tried to use tracing system client, but you may be confused why there is a "span". It is from below frequently quoted diagram:

It is a elegant data structure for tracing data. It's why all the followers keep using the design and the terms untouched.

Another important term introduced is "annotation", which provides a mechanism of application specific information extension. It quotes in chapter 3.3 of the paper, "Programmers tend to use application-specific annotations either as a kind of distributed debug log file or to classify traces by some application-specific features". Other tracing systems also have similar designs for the same proposal. In fact, all the experiences and use cases in Google mentioned in chapter 6 of the paper rely on annotation attached info more or less. On the other hand, authors did not forget to remind "small overhead" principle in the paper. It is necessary because there are still some people use tracing system to do the work of other logging and statistics tool.

Tracing System Implementation Principles

Other than definitions of tracing system data structures, the other contribution of Dapper is to explain how to implement a tracing system (even not in details) and why.

Let's look at anther thousand times quoted diagram as below:

The pipeline can be concluded in several key points:

Trace data is sent to a specific local daemon. Services do not connect directly with the collector to avoid latency or blocking error.
All daemons send data to central collector
Daemon sent data should be sampled in order to keep data collection is small as possible, while fraction of data does not degrade tracing quality.
The central collector save tracing data in search oriented no-sql database with appropriate indexes.

All these principles are completely inherited by followers.

Dapper paper gives other important notes which are not only for google internal usage but more universal for all cases.

For example, Ubiquitous deployment. In google the dapper client is embed in google internal gRPC framework, trace context is automatically injected. And dapper daemon is automatically deployed in service container using common google internal base image for services. Though we cannot use all these google internal infrastructures (in fact we do not need to....), it is still a very helpful best practice.

There are other significant suggestions such as second sampling of aggressive tracing data in collector, usage of column db like HBase (open source implementation of BigTable) to save tracing data, etc.

Zipkin

Zipkin is the first well-known open source implementation of Dapper like system. The relation between Zipkin and Dapper is just like that between Hadoop and BigTable.

Zipkin focuses on the part of its collector with BigTable like storage (Zipkin uses Cassandra and Elasticsearch), Dapper DAPI like API and GUI. I believe that most users choose it because of its out-of-box web UI, it was really fantastic at that time (maybe not so good as Dapper, but impressed enough for those not working in google), though Zipkin is missing some features such as trace data sampling, daemon (on reporter server) controlling etc.

It is a straightforward implementation using mature Java open source component. I think there were many similar implementations were done silently. Zipkin was just the first and might be one of the best ones. Some talented and ambitious teams are users of Zipkin. Among them, there was Uber devop team, which made their own tracing system later. And I will put a few words on it in following chapters.

Zipkin has been changed name to OpenZipkin, aims to reduce more contribution from community. But compared with opentracing/jaegertracing which are already in Cloud Native Foundation, Zipkin looks not so sexy to be the first choice for new projects implementing tracing.

Opentracing

Opentracing is a project for universal tracing specification rather than ready-for-production implementation. In the GitHub repo of opentracing, you can find:

Semantic Specification: definition of data models (trace, span, context) and APIs. It gives implementation details of the tracing data models which are missing in original Google Dapper paper.
Semantic Conventions: tags for span, just like the usage of kv in annotations mentioned in Google Dapper paper.
Client libs for different languages: in fact opentracing does not provide production ready tracing libraries. It just defines the APIs following opentracing specification. Specific instruments need to inherit the lib of opentracing clients as base class and overload the methods. (just like what Uber jaeger client does).

For more clear data model overview, you can look at the following diagram:

Remember that opentracing is just specification. You use it as guideline but not a out-of-box tracing production. It is a good starting point to develop your own tracing system or to understand how tracing system works. Attention that opentracing does not cover any specification or implementation details for tracing daemon and centralized collector.

The specific tracing clients used in frameworks, libraries and projects are called instrumentations. Seeing different implementation of context, there is no way to have a generalized instrumentations for all the frameworks. OpenTracing gives a clear guide to develop instrumentation based on it:

The work of instrumentation libraries generally consists of three steps:

When a service receives a new request (over HTTP or some other protocol), it uses OpenTracing's inject/extract API to continue an active trace, creating a Span object in the process. If the request does not contain an active trace, the service starts a new trace and a new root Span.

The service needs to store the current Span in some request-local storage, where it can be retrieved from when a child Span must be created, e.g. in case of the service making an RPC to another service.

When making outbound calls to another service, the current Span must be retrieved from request-local storage, a child span must be created (e.g., by using the start_child_span() helper), and that child span must be embedded into the outbound request (e.g., using HTTP headers) via OpenTracing's inject/extract API.

For direct usage of tracing clients (instrumentations), you can find mainstream ones at OpenTracing API Contributions. If you are using some Python, you may find some useful lib tracing instrumentations at Uber OpenTracing Python Instrumentation.

Hujun added Tracing Dapper Zipkin Jaeger OpenTracing labels May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracing System Overview #3

Tracing System Overview #3

Hujun commented May 9, 2018 •

edited

Loading

Tracing System Overview #3

Tracing System Overview #3

Comments

Hujun commented May 9, 2018 • edited Loading

Google Dapper

Terms

Tracing System Implementation Principles

Zipkin

Opentracing

Hujun commented May 9, 2018 •

edited

Loading