Support distributed `ANALYZE` #2374

waynexia · 2023-09-13T03:40:14Z

What problem does the new feature solve?

EXPLAIN ANALYZE only contains the execution result in frontend and the gRPC time to each datanodes. It works but can be more detailed.

What does the feature do?

Also show the detailed ANALYZE result in each datanode. I.e., implement the distributed ANALZYE plan

Implementation challenges

No response

The text was updated successfully, but these errors were encountered:

NiwakaDev · 2023-10-26T10:38:35Z

@waynexia

What kind of output do you expect?

For example, something like:

+----------------------------------------------------+
| component | plan_type         | plan               |
+----------------------------------------------------+
| frontend  | Plan with Metrics | RepartitionExec    |
+----------------------------------------------------+
| node1     | ~                 | ~                  |
+----------------------------------------------------+
| node2     | ~                 | ~                  |
+----------------------------------------------------+
| node3     | ~                 | ~                  |
+----------------------------------------------------+

waynexia · 2023-10-26T11:00:25Z

Considering the plan might be further distributed into more parts, I'd prefer to use a tuple (stage, node) to distinguish each part. At present we only have two stages, the first is executed in datanode and the second is executed in frontend. For your example, it would become something like the following.

+------------------------------------------------------------------------+
| stage   | node                | plan_type         | plan               |
+------------------------------------------------------------------------+
| stage 2 | node 1 (addr: ...)  | Plan with Metrics | RepartitionExec    |
+------------------------------------------------------------------------+
| stage 1 | node 1 (addr: ...)  | ~                 | ~                  |
+------------------------------------------------------------------------+
| stage 1 | node 2 (addr: ...)  | ~                 | ~                  |
+------------------------------------------------------------------------+
| stage 1 | node 3 (addr: ...)  | ~                 | ~                  |
+------------------------------------------------------------------------+

Some key point:

Node number are in the scope of stage. Start from 1 in each stage
Stage is counted bottom-up. The first evaluated is stage 1.
Maybe we don't need to assign node number to each node, the addr might be enough.

NiwakaDev · 2023-10-27T14:45:36Z

@waynexia

I'd prefer to use a tuple (stage, node) to distinguish each part.

I see. By the way, if the verbose option is set, I guess that the output is as follows:

+--------------------------------------------------------------------------------------------
| stage   | node                | plan_type         | plan               | output rows | ...
+--------------------------------------------------------------------------------------------
| stage 2 | node 1 (addr: ...)  | Plan with Metrics | RepartitionExec    | 2           | ...
+--------------------------------------------------------------------------------------------
| stage 1 | node 1 (addr: ...)  | ~                 | ~                  | 0           | ...
+--------------------------------------------------------------------------------------------
| stage 1 | node 2 (addr: ...)  | ~                 | ~                  | 1           | ...
+--------------------------------------------------------------------------------------------
| stage 1 | node 3 (addr: ...)  | ~                 | ~                  | 1           | ...
+--------------------------------------------------------------------------------------------

arrow-datafusion generates some values like output rows as the plan_type column, but in our use-case I guess it is better for the output to be format like the immediately above example in order to avoid duplication of (stage, node).

NiwakaDev · 2023-11-03T07:40:56Z

@waynexia

If you agree with the above format, I would like to work on this issue.

waynexia · 2023-11-03T07:45:57Z

Ahh, sorry for the delay.

Other parts look good to me. But things like output_rows are execution statistics, which might be missing. A simple workaround is to put all those statistics into a single string, if necessary.

waynexia · 2023-12-15T11:09:59Z

Hi @NiwakaDev, just a friendly ping. Do you have an initial plan or a rough structure for this? I wondered if you would like to have a discussion on any undetermined thing or question.

NiwakaDev · 2023-12-18T00:00:08Z

@waynexia

Do you have an initial plan or a rough structure for this? I wondered if you would like to have a discussion on any undetermined thing or question.

Here's an initial plan:

Send DistributedAnalyzePlan (Custom Logical plan) to each datanode.
- DistributedAnalyzePlan simply wraps input_stream like: https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/logical_plan/plan.rs#L2211-L2218.
Execute DistributedAnalyzeExec on each datanode side like https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/analyze.rs#L143-L247.
Integrate each result on the frontend side like MergeScan.

DistributedAnalyzeExec outputs:
The difference from the normal AnalyzeExec is that there're two types of output.

two types of output:

input_stream.next()
datanode analyze result like AnalyzeExec (https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/analyze.rs#L201-L247)

While AnalyzeExec ignores input_stream.next(), in Distributed AnalyzeExec, I guess frontend needs both outputs to construct the above result (#2374 (comment)).

I haven't yet come up with a solution on how to send both input_stream.next() and output like AnalyzeExec from each datanode to frontend.

waynexia · 2023-12-18T11:44:18Z

Thanks for your thoughtful investigation 👍

I have one concern about passing intermediate metrics (those rendered in AnalyzeExec), which are transformed into string literal, and become hard to reuse in later phases like aggregate or filter.

Thus I came up with another way: encode and transfer metrics (MetricsSet in datafusion, specifically) along with data for each query. And report it to the user on demand, just like how an AnalyzeExec works in a single instance, drop data and execute report per-plan metrics.

For transferring metrics together with data, I've submitted a PR to add corresponding fields on proto file GreptimeTeam/greptime-proto#130 (if we decide to go this way, we can define some general metric in the proto message, instead of string-string map)

Then we don't need DistributedAnalyzePlan for datanodes. But only an uppermost, customized AnalyzeExec to extract and render the "distributed" metrics.

waynexia · 2023-12-18T11:52:33Z

One thing I haven't figured out is how we handle different metrics from different nodes. In datafusion, those metrics are attached to the plan itself. E.g., a join plan has two children, and each child can keep its own metrics. But here we don't have actual children nodes beside MergeScan. So how can we keep the tree structures for metrics from them?

If we don't need to distinguish metrics from each node, we can aggregate them into one sub-tree in MergeScan, since they are going to have the same physical plan. But if what we want is above per-phase and per-node analyze, this can't help.

NiwakaDev · 2023-12-19T11:40:17Z

@waynexia

One thing I haven't figured out is how we handle different metrics from different nodes. In datafusion, those metrics are attached to the plan itself. E.g., a join plan has two children, and each child can keep its own metrics. But here we don't have actual children nodes beside MergeScan. So how can we keep the tree structures for metrics from them?

The issue you wrote is related to the below code? Sorry, I might be wrong because I'm not familiar with datafusion.
https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/display.rs#L243-L253
https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/visitor.rs#L24-L34

Thus I came up with another way: encode and transfer metrics (MetricsSet in datafusion, specifically) along with data for each query. And report it to the user on demand, just like how an AnalyzeExec works in a single instance, drop data and execute report per-plan metrics.

If we don't need analyze per-node, I agree with this.

NiwakaDev · 2023-12-19T11:42:50Z

@waynexia

By the way, if we implement your idea, what kind of output do you expect?

waynexia · 2023-12-19T12:00:35Z

One thing I haven't figured out is how we handle different metrics from different nodes. In datafusion, those metrics are attached to the plan itself. E.g., a join plan has two children, and each child can keep its own metrics. But here we don't have actual children nodes beside MergeScan. So how can we keep the tree structures for metrics from them?

The issue you wrote is related to the below code? Sorry, I might be wrong because I'm not familiar with datafusion.

greptimedb/src/query/src/dist_plan/merge_scan.rs

Lines 264 to 266 in c7b3677

    
           fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> { 
        
               vec![] 
        
           }

MergeScanExec is the leaf node in Frontend's execution plan tree. It submits a part of the query to datanodes and merges result from them (hence "merge scan"). Since those part of plan is not executed in frontend, it neither has children nor their metrics. This prevents the visitor method you mentioned above from walking into MergeScanExec.

We can retrieve metrics from datanodes, but I'm afraid we have to keep and access those "remote metrics" in a different way for this reason.

By the way, if we implement your idea, what kind of output do you expect?

I would like to have two forms. One is distinguished with the tuple (phase, node), we can retrieve un-aggregated, per-phase and per-node metrics for detailed analysis. Another may look closer to the ordinary ANALYZE, which hides the distributed execution's details and gives a rough and aggregated result.

waynexia · 2023-12-22T14:51:55Z

Update: at this stage, it is clear that we have to find a way to pass data and execution metrics together in the same query call. @shuiyisong and I are trying to add a method to SendableRecordBatchStream to provide the corresponding execution metrics. But it's still undetermined how to define, organize and expose those metrics.

I haven't yet come up with a solution on how to send both input_stream.next() and output like AnalyzeExec from each datanode to frontend.

We can assume this issue is resolved (if everything works as expected...) and bring this ticket forward 🙌 @NiwakaDev

NiwakaDev · 2023-12-24T23:58:49Z

@waynexia

If we don't need to distinguish metrics from each node, we can aggregate them into one sub-tree in MergeScan, since they are going to have the same physical plan.

Do we select "aggregate them into one child plan in MergeScan", not plan per node?

Something like:

| Plan with Metrics | MergeScanExec: peers=[4398046511104(1024, 0), ], metrics=[output_rows=2, ready_time=180.582083ms, first_consume_time=180.884833ms, finish_time=180.984ms]
|   		    |    ProjectionExec: expr=[name@0 as name], metrics=[output_rows=0, elapsed_compute=8ns]                                                           
|                   |     CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=2.499µs]                                                
|                   |       FilterExec: value@1 = 10, metrics=[output_rows=0, elapsed_compute=653.341µs]                                                               
|                   |         RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[fetch_time=154.749µs, repart_time=1ns, send_time=6.292µs] 
~

or

| Plan with Metrics | MergeScanExec: ~, metrics=[output_rows=2, ~, datanode1_metrics=~, datanode2_metrics=~]
~

waynexia · 2023-12-26T09:16:30Z

What do you think about distinguishing them by the VERBOSE word? e.g.:

EXPLAIN ANALYZE <QUERY> gives aggregated result:

| Plan with Metrics | MergeScanExec: peers=[4398046511104(1024, 0), ], metrics=[output_rows=2, ready_time=180.582083ms, first_consume_time=180.884833ms, finish_time=180.984ms]
|   		    |    ProjectionExec: expr=[name@0 as name], metrics=[output_rows=0, elapsed_compute=8ns]                                                           
|                   |     CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=2.499µs]                                                
|                   |       FilterExec: value@1 = 10, metrics=[output_rows=0, elapsed_compute=653.341µs]                                                               
|                   |         RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[fetch_time=154.749µs, repart_time=1ns, send_time=6.292µs] 
~

and EXPLAIN ANALYZE VERBOSE <QUERY> gives un-aggregated result:

+-------------------------------------------------------------------------
| stage   | node                | plan_type         | plan               | 
+------------------------------------------------------------------------
| stage 2 | node 1 (addr: ...)  | Plan with Metrics | RepartitionExec    | 
+------------------------------------------------------------------------
| stage 1 | node 1 (addr: ...)  | ~                 | ~                  |
+------------------------------------------------------------------------
| stage 1 | node 2 (addr: ...)  | ~                 | ~                  |
+------------------------------------------------------------------------
| stage 1 | node 3 (addr: ...)  | ~                 | ~                  | 
+------------------------------------------------------------------------

NiwakaDev · 2023-12-28T03:41:17Z

@waynexia

What do you think about distinguishing them by the VERBOSE word?

Looks good to me, but as you said, I guess that we need to think another approach to implement that.

If we don't need to distinguish metrics from each node, we can aggregate them into one sub-tree in MergeScan, since they are going to have the same physical plan. But if what we want is above per-phase and per-node analyze, this can't help.

Maybe, we need to divide it into two logical plans for two purposes, one is not verbose plan, the other is verbose plan.
As you said, I think we can first implement not verbose plan.

waynexia · 2023-12-28T04:16:23Z

Yes, I'm afraid the built-in ANALYZE is not aware of those logics. We have to make a new one for distributed purpose

Maybe, we need to divide it into two logical plans for two purposes, one is not verbose plan, the other is verbose plan.
As you said, I think we can first implement not verbose plan.

Looks good to me! 👍 I guess we've found answers for all previous problems we had?

NiwakaDev · 2024-01-10T00:01:16Z

Sorry for the late reply.

I guess we've found answers for all previous problems we had?

Yes! I'll review the above PR tonight and tomorrow. Sorry for the late reply again.

waynexia · 2024-01-10T02:50:24Z

Don't worry! Hope you have had a nice New Year holiday 🎉

NiwakaDev · 2024-01-14T23:59:33Z

@waynexia
Since we need to have plan tree of each datanode under MergeScanExec, the json format(#3113) of each datanode query needs to be something like:

{
	"name": "ProjectionExec",
	"metrics:" {
		"total_num": 0,
			~
	},
	"children": [
		{
			"name": "CoalesceBatchesExec",
			"metrics": {~},
			"children": [
				~
			]
		},
	]
}

After json format discussion lands, I'll write a rough implementation of the not verbose plan based on #3113.

waynexia · 2024-03-14T09:56:00Z

#3113 is merged, and there are some little changes to plan metrics after that. Now we will pass the corresponding physical (execution) plan together with the result RecordBatchStream. This might make it easier to implement this task.

NiwakaDev · 2024-07-03T14:46:59Z

@waynexia
I apologize for the delayed response. I guess that we can close this issue. I'll find another DataFusion issue.

waynexia added the C-feature Category Features label Sep 13, 2023

waynexia assigned NiwakaDev Nov 3, 2023

shuiyisong mentioned this issue Jan 7, 2024

chore: carry metrics in flight metadata from datanode to frontend #3113

Merged

3 tasks

waynexia mentioned this issue Mar 21, 2024

Remove the unnecessary ser/de by substrait in standalone mode #3520

Closed

waynexia mentioned this issue May 10, 2024

feat: support distributed EXPLAIN ANALYZE #3908

Merged

3 tasks

NiwakaDev closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed `ANALYZE` #2374

Support distributed `ANALYZE` #2374

waynexia commented Sep 13, 2023

NiwakaDev commented Oct 26, 2023

waynexia commented Oct 26, 2023

NiwakaDev commented Oct 27, 2023 •

edited

Loading

NiwakaDev commented Nov 3, 2023

waynexia commented Nov 3, 2023

waynexia commented Dec 15, 2023

NiwakaDev commented Dec 18, 2023 •

edited

Loading

waynexia commented Dec 18, 2023 •

edited

Loading

waynexia commented Dec 18, 2023

NiwakaDev commented Dec 19, 2023 •

edited

Loading

NiwakaDev commented Dec 19, 2023

waynexia commented Dec 19, 2023

waynexia commented Dec 22, 2023

NiwakaDev commented Dec 24, 2023 •

edited

Loading

waynexia commented Dec 26, 2023 •

edited

Loading

NiwakaDev commented Dec 28, 2023 •

edited

Loading

waynexia commented Dec 28, 2023

NiwakaDev commented Jan 10, 2024

waynexia commented Jan 10, 2024

NiwakaDev commented Jan 14, 2024 •

edited

Loading

waynexia commented Mar 14, 2024

NiwakaDev commented Jul 3, 2024

Support distributed ANALYZE #2374

Support distributed ANALYZE #2374

Comments

waynexia commented Sep 13, 2023

What problem does the new feature solve?

What does the feature do?

Implementation challenges

NiwakaDev commented Oct 26, 2023

waynexia commented Oct 26, 2023

NiwakaDev commented Oct 27, 2023 • edited Loading

NiwakaDev commented Nov 3, 2023

waynexia commented Nov 3, 2023

waynexia commented Dec 15, 2023

NiwakaDev commented Dec 18, 2023 • edited Loading

waynexia commented Dec 18, 2023 • edited Loading

waynexia commented Dec 18, 2023

NiwakaDev commented Dec 19, 2023 • edited Loading

NiwakaDev commented Dec 19, 2023

waynexia commented Dec 19, 2023

waynexia commented Dec 22, 2023

NiwakaDev commented Dec 24, 2023 • edited Loading

waynexia commented Dec 26, 2023 • edited Loading

NiwakaDev commented Dec 28, 2023 • edited Loading

waynexia commented Dec 28, 2023

NiwakaDev commented Jan 10, 2024

waynexia commented Jan 10, 2024

NiwakaDev commented Jan 14, 2024 • edited Loading

waynexia commented Mar 14, 2024

NiwakaDev commented Jul 3, 2024

Support distributed `ANALYZE` #2374

Support distributed `ANALYZE` #2374

NiwakaDev commented Oct 27, 2023 •

edited

Loading

NiwakaDev commented Dec 18, 2023 •

edited

Loading

waynexia commented Dec 18, 2023 •

edited

Loading

NiwakaDev commented Dec 19, 2023 •

edited

Loading

NiwakaDev commented Dec 24, 2023 •

edited

Loading

waynexia commented Dec 26, 2023 •

edited

Loading

NiwakaDev commented Dec 28, 2023 •

edited

Loading

NiwakaDev commented Jan 14, 2024 •

edited

Loading