Skip to content

Roadmap to promote the opentelemetry tracing feature to be production ready #49244

@FrankChen021

Description

@FrankChen021

The distributed tracing feature is heavily used in our environment to deal with problems in distributed environment and it has to be proven a powerful feature for a long time that helps us solve or find out many complicated problems.
Most of the work that are implemented in our own branch has been contributed back to the community.

The main features in the master branch now are as follows:

  1. Trace on cluster DDLs
  2. Trace a query on local node and its sub-queries on remote node
  3. Trace async or sync INSERT on distributed table
  4. Trace queries from HTTP/TCP/GRPC
  5. Propagate tracing context to downstream servers via URL engine

But the status of this feature is still marked as experienmental.
From the community perspective, I think it's time for us to give a plan to promot it as a production ready feature.

Before that, here some things that I can come across to be completed:

  1. Standardize the attribute names as defined in the opentelemetry specification.
    The attribute names now can be defined in anyway, it's better to use current specification to standardize some of them to allow the logs can be easily handled by some other external visulization tools. This is NOT backward compatible.
  2. Investigate the root cause of Abort in OpenTelemetry::SpanHolder::finish() #49185
    Even though it occurs in Debug build, but it indicates that this may lead to incorrect logs in the Release build if it happens
  3. Add trace_id column to system.query_log
    This will give a clear info in the query log that if a query is traced or not. And then it can be used to search/join the opentelemetry_span_logs table
  4. Support Materialized View in the distributed tracing
    See: Add OpenTelemetry Support to Materialized View #41672
  5. Propagate the tracing context to remote S3
    Some S3-compatible remote storages support this distributed tracing feature. It would give us the ability to deal with problems between ClickHouse and underlying S3 storage.

What do you think? @alexey-milovidov

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions