Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance ClickHouse Profile: generate a uniq id for steps and processors #63518

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

qhsong
Copy link

@qhsong qhsong commented May 8, 2024

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

uniform step and pipeline

Clickhouse current profiling are sometimes confuse for me:

  • For explain plan we got step name
  • For explain pipeline we got processor name
  • For system.processor_profile_log/system.opentelemetry_span_log, we got an pointer address

When we need analyze a complex query with dup name, it's hard to identify those things.
I think we need generate an uniq ID for every processor and step which should meaningful and can not change for every query. I use ${NAME}_${INDEX} patten as uniq ID style. ${NAME} means step/processor name, ${INDEX} is generated by generated time.

After this PR, for a query select * from t1 as t join t1 as t2 on t.a=t2.a where t.a=1

  • explain
  ┌─explain───────────────────────────────────────────┐
1. │ Expression_20 ((Project names + (Projection + ))) │
2. │   Join_6 (JOIN FillRightFirst)                    │
3. │     Expression_21                                 │
4. │       ReadFromMergeTree_0 (default.t1)            │
5. │     Expression_22                                 │
6. │       ReadFromMergeTree_3 (default.t1)            │
   └───────────────────────────────────────────────────┘
  • explain pipeline
    ┌─explain──────────────────────────────────────────────────────────────────┐
 1. │ (Expression_20)                                                          │
 2. │ ExpressionTransform                                                      │
 3. │   (Join_6)                                                               │
 4. │   JoiningTransform 2 → 1                                                 │
 5. │     (Expression_21)                                                      │
 6. │     ExpressionTransform                                                  │
 7. │       (ReadFromMergeTree_0)                                              │
 8. │       MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) 0 → 1   │
 9. │     (Expression_22)                                                      │
10. │     FillingRightJoinSide                                                 │
11. │       ExpressionTransform                                                │
12. │         (ReadFromMergeTree_3)                                            │
13. │         MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) 0 → 1 │
    └──────────────────────────────────────────────────────────────────────────┘
  • select id, name,parent_ids,plan_step from system.processors_profile_log;
     ┌─id──────────────────────────┬─name────────────────────┬─parent_ids──────────────────────┬─plan_step──────────┐
  6. │ SourceFromSingleChunk_1     │ SourceFromSingleChunk   │ ['ExpressionTransform_2']       │                    │
  7. │ ExpressionTransform_2       │ ExpressionTransform     │ ['LimitsCheckingTransform_3']   │ Expression_19      │
  8. │ LimitsCheckingTransform_3   │ LimitsCheckingTransform │ ['LazyOutputFormat_4']          │                    │
  9. │ NullSource_5                │ NullSource              │ ['LazyOutputFormat_4']          │                    │
 10. │ NullSource_6                │ NullSource              │ ['LazyOutputFormat_4']          │                    │
 11. │ LazyOutputFormat_4          │ LazyOutputFormat        │ []                              │                    │
 12. │ SourceFromSingleChunk_18    │ SourceFromSingleChunk   │ ['FilterTransform_19']          │                    │
 13. │ FilterTransform_19          │ FilterTransform         │ ['ExpressionTransform_20']      │ Filter_416         │
 14. │ ExpressionTransform_20      │ ExpressionTransform     │ ['DistinctTransform_83']        │ Expression_167     │
 15. │ SourceFromSingleChunk_21    │ SourceFromSingleChunk   │ ['FilterTransform_22']          │                    │
 16. │ FilterTransform_22          │ FilterTransform         │ ['ExpressionTransform_23']      │ Filter_417         │
 17. │ ExpressionTransform_23      │ ExpressionTransform     │ ['DistinctTransform_84']        │ Expression_433     │
  • select operation_name from system.opentelemetry_span_log;
    ┌─operation_name───────────────────────────────────────────────┐
 1. │ DB::InterpreterSelectQueryAnalyzer::execute()                │
 2. │ ThreadPoolRead                                               │
 3. │ ThreadPoolRead                                               │
 4. │ MergeTreeSource::tryGenerate()                               │
 5. │ MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder)_0 │
 6. │ ExpressionTransform_1                                        │
 7. │ ExpressionTransform_2                                        │
 8. │ LimitsCheckingTransform_3                                    │
 9. │ LazyOutputFormat_4                                           │
10. │ MergeTreeSource::tryGenerate()                               │
11. │ MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder)_0 │
12. │ NullSource_5                                                 │
13. │ NullSource_6                                                 │
14. │ LazyOutputFormat_4                                           │
15. │ PipelineExecutor::execute()                                  │
16. │ QueryPullPipeEx                                              │
17. │ query                                                        │
18. │ TCPHandler                                                   │
    └──────────────────────────────────────────────────────────────┘

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Modify your CI run

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • All with Azure
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@CLAassistant
Copy link

CLAassistant commented May 8, 2024

CLA assistant check
All committers have signed the CLA.

@nickitat nickitat self-assigned this May 8, 2024
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-improvement Pull request with some product improvements label May 8, 2024
@robot-ch-test-poll1
Copy link
Contributor

robot-ch-test-poll1 commented May 8, 2024

This is an automated comment for commit ee5d22c with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Successful checks
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success

@nickitat nickitat added the can be tested Allows running workflows for external contributors label May 8, 2024
@qhsong
Copy link
Author

qhsong commented May 8, 2024

Not sure this idea is work for ClickHouse, if worked, maybe I will add more test case for this.

@nickitat
Copy link
Member

nickitat commented May 8, 2024

for explain plan and pipeline I don't think there is a lot of confusion, since they already contain formatting that displays hierarchy. also we have a lot of tests that check plan or pipeline specifically. they all will break.
speaking of processors_profile_log - fully agree.
maybe let's implement it only for processors_profile_log and opentelemetry_span_log?

@qhsong
Copy link
Author

qhsong commented May 9, 2024

for explain plan and pipeline I don't think there is a lot of confusion, since they already contain formatting that displays hierarchy. also we have a lot of tests that check plan or pipeline specifically. they all will break. speaking of processors_profile_log - fully agree. maybe let's implement it only for processors_profile_log and opentelemetry_span_log?

In fact, I believe that the explain plan plays a crucial role in this PR. When we use processor_profile_log or open telemetry_span_log to identify a specific process or step with slow execution, how can we determine the corresponding details for this steps? Therefore, I think it's essential for us.

I have observed that it break some stateless test cases. I believe it's worthwhile to modify the test case content. It's not hard to change.

@UnamedRus
Copy link
Contributor

for explain plan and pipeline

I actually think, that it's make more sense to add them (ids) in json output for those statements, which is more suitable format for consumption by program, and introduction of new field simpler here.

@qhsong
Copy link
Author

qhsong commented May 10, 2024

for explain plan and pipeline

I actually think, that it's make more sense to add them (ids) in json output for those statements, which is more suitable format for consumption by program, and introduction of new field simpler here.

I also add a field in explain json. I will fix it later.

@qhsong
Copy link
Author

qhsong commented May 15, 2024

Summary this feature:

  • Add "Node Id" in explain json=1
            {
              "Node Type": "Expression",
              "Node Id": "Expression_22",
              "Plans": [
                {
                  "Node Type": "ReadFromMergeTree",
                  "Node Id": "ReadFromMergeTree_2",
                  "Description": "default.t1"
                }
              ]
            }
  • Add Step id and iprocessor id in explain pipeline graph=1
    Screenshot 2024-05-15 at 21 06 21

  • Add processor_uniq_id and step_uniq_id in processors_profile_log

  • Change Processor_id in opentelemetry_span_log

nickitat
nickitat previously approved these changes May 15, 2024
src/Interpreters/ProcessorsProfileLog.cpp Outdated Show resolved Hide resolved
src/Interpreters/ProcessorsProfileLog.h Outdated Show resolved Hide resolved
src/Interpreters/executeQuery.cpp Outdated Show resolved Hide resolved
src/QueryPipeline/QueryPipelineBuilder.cpp Show resolved Hide resolved
src/Processors/QueryPlan/QueryPlan.h Outdated Show resolved Hide resolved
@@ -1336,7 +1336,14 @@ class Context: public ContextData, public std::enable_shared_from_this<Context>
std::shared_ptr<Clusters> getClustersImpl(std::lock_guard<std::mutex> & lock) const;

/// Throttling

size_t step_count = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not put it inside Context. e.g. it could be a static data member of IQueryPlanStep. the same for IProcessor

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in context can make explain result more stable, every time we explain same query will get same id.
If we put in static data, we can not get stable result, so I put it in context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we better have unstable results than polluting context with this random counters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For stateless case 01786_explain_merge_tree, if not stable, the result may not have fixed result. So How to fix this case? just disable json output?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just disable json output

I guess it makes no difference to the test logic what output format we use

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just worries background thread call Plan. I will remove json output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we better have unstable results than polluting context with this random counters

Recently I think stable result is a important feature, If we worry about polluting context, how about put a point of int in CurrentThread::ThreadStatus. This should not polluting context and get a better stable index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently I think stable result is a important feature

what value exactly do you see in it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to Debugging.
When I found some processors slow, we can use explain query to identify its' steps. if we use static value , It hard to do it if query is complex.

If you think it is not a important, I will change it to static value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants