Roadmap to promote the opentelemetry tracing feature to be production ready

The distributed tracing feature is heavily used in our environment to deal with problems in distributed environment and it has to be proven a powerful feature for a long time that helps us solve or find out many complicated problems. 
Most of the work that are implemented in our own branch has been contributed back to the community.

The main features in the master branch now are as follows:
1. Trace on cluster DDLs
2. Trace a query on local node and its sub-queries on remote node
3. Trace async or sync INSERT on distributed table
4. Trace queries from HTTP/TCP/GRPC
5. Propagate tracing context to downstream servers via URL engine

But the status of this feature is still marked as [experienmental](https://clickhouse.com/docs/en/operations/opentelemetry). 
From the community perspective, I think it's time for us to give a plan to promot it as a production ready feature. 

Before that, here some things that I can come across to be completed:

1. Standardize the attribute names as defined in the [opentelemetry specification](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/). 
    The attribute names now can be defined in anyway, it's better to use current specification to standardize some of them to allow the logs can be easily handled by some other external visulization tools. **This is NOT backward compatible.**
2. Investigate the root cause of https://github.com/ClickHouse/ClickHouse/issues/49185
     Even though it occurs in Debug build, but it indicates that this may lead to incorrect logs in the Release build if it happens
3. Add `trace_id` column to `system.query_log`
    This will give a clear info in the query log that if a query is traced or not. And then it can be used to search/join the `opentelemetry_span_logs` table
4. Support Materialized View in the distributed tracing
     See: https://github.com/ClickHouse/ClickHouse/pull/41672
5. Propagate the tracing context to remote S3
    Some S3-compatible remote storages support this distributed tracing feature. It would give us the ability to deal with problems between ClickHouse and underlying S3 storage.

What do you think? @alexey-milovidov 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap to promote the opentelemetry tracing feature to be production ready #49244

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Roadmap to promote the opentelemetry tracing feature to be production ready #49244

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions