Metrics: Why are the metrics not directly included in the execution plan? #170

ASchwad · 2021-02-11T15:45:45Z

ASchwad
Feb 11, 2021

Hey there,
as far as I understand, the queryExecution is triggered as soon as the execution has been finished.
Is there are reason why the execution metrics are contained in an 'extra' Api call or could they also be included directly into the execution plan?

Thanks in advance!
Alex

Answered by wajda

Feb 15, 2021

Is there a plan on how the logic could work to send the plan only once? Since the logic could also work to "version" a data pipeline.

That's exactly the point - execution plan represents the static model of the pipeline, sort of a math formula. You can see it as a version of the pipeline. It's only when you execute it, some data flows through it, and the lineage of that data is tracked. There is no logical value in recording the same execution plan over and over again, it only creates unnecessary clutter in the database and make the cumulative lineage hard to analyze. It gets especially perceivable when using "append" writes, in which case the full lineage of the content of the given fi…

View full answer

cerveada · 2021-02-11T16:02:55Z

cerveada
Feb 11, 2021
Maintainer

Hi,

The API support an option to have one execution plan and multiple execution events. So if you execute the same plan multiple times you can just send the plan once and then multiple events on the second endpoint.

The Spark agent doesn't do this right now, but the API is prepared for that.

2 replies

ASchwad Feb 11, 2021
Author

Thank you for your answer. I think this is an interesting concept. Is there a plan on how the logic could work to send the plan only once? Since the logic could also work to "version" a data pipeline.

In my opinion, the execution plan should always be sent and contain the metrics. The API could then decide whether it can reason the execution plan to an existing resource, or if the plan should be processed. But the metrics can be extracted in any case.

I am asking this, since the synchronous dependency for the agent feels a bit clunky to me. I would prefer the agent to 'fire and forget' the execution plan with the metrics, as our processing of the execution plan can take some time and requires the Spark execution to wait. I think I even observed for some executions, that the processing took too much time and scala default timeout of 30 seconds for http requests was reached.

cerveada Feb 12, 2021
Maintainer

Is there a plan on how the logic could work to send the plan only once? Since the logic could also work to "version" a data pipeline.

This is a question for Alex @wajda

I would prefer the agent to 'fire and forget' the execution plan with the metrics

This will be possible. The plan is to still have two endpoints, but the client will not need to worry about the response just send both messages with the same planId and the server will connect them. (more info here: AbsaOSS/spline#823)

There is also Kafka producer and consumer currently under development as an alternative for the rest API.

wajda · 2021-02-15T12:20:36Z

wajda
Feb 15, 2021
Maintainer

Is there a plan on how the logic could work to send the plan only once? Since the logic could also work to "version" a data pipeline.

That's exactly the point - execution plan represents the static model of the pipeline, sort of a math formula. You can see it as a version of the pipeline. It's only when you execute it, some data flows through it, and the lineage of that data is tracked. There is no logical value in recording the same execution plan over and over again, it only creates unnecessary clutter in the database and make the cumulative lineage hard to analyze. It gets especially perceivable when using "append" writes, in which case the full lineage of the content of the given file would be presented as a union of all past append writes following the last overwrite. If every write comes with its own exec plan, the resulted graph would be a mess. Unfortunately, at the moment the agent cannot reuse / de-duplicate exec plans automatically. We are aware of this limitation and will likely be introducing some label based version mechanism to overcome this problem at least for cases when the structure of the data pipelines doesn't change frequently.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics: Why are the metrics not directly included in the execution plan? #170

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Metrics: Why are the metrics not directly included in the execution plan? #170

ASchwad Feb 11, 2021

Replies: 2 comments · 2 replies

cerveada Feb 11, 2021 Maintainer

ASchwad Feb 11, 2021 Author

cerveada Feb 12, 2021 Maintainer

wajda Feb 15, 2021 Maintainer

ASchwad
Feb 11, 2021

Replies: 2 comments 2 replies

cerveada
Feb 11, 2021
Maintainer

ASchwad Feb 11, 2021
Author

cerveada Feb 12, 2021
Maintainer

wajda
Feb 15, 2021
Maintainer