Attribute lineage #114

wajda · 2019-02-06T10:44:15Z

No description provided.

vidma · 2019-02-07T10:57:24Z

What we want is to have the ability to say which output dataset columns depend on which columns of input datasets (and also to know if that's data or control lineage).

what's the state of attribute / per-column lineage in spline? it seems that it's not supported by design?

in spline, some operations have Expressions attached to them, however, they do not provide any clear relationship to which output columns were affected by operations like (FYI spark query plans have this information available as operation's output expression IDs (exprId) which are easily mappable to output attribute names - we use it our in-house spark query plan parser at kensu.io to obtain column lineage and it works pretty well).

are there any plans to add this?

P.S. there might be a hacky way to parse existing spline operations to obtain limited column lineage, however, it doesn't seem very reliable (and it seems it might not catch all column lineage...):

data lineage: operations like

Projection (expect all Projection entries to be in the form of expr.Alias(outputAttributeName, ...)
Aggregate (expect aggregate output column names to be mappable to output dataset attribute ids)

control lineage: operations like Join, Filter, Sort (add control lineage to all output "attributes")

it would be much better if each operation clearly defined which output attributes depend on which input data...

wajda · 2019-02-08T12:32:19Z

Hi @vidma,
Yes, we do have it in the roadmap, and it has a priority. Spline 0.3 kinda supports it on the UI level (if you click on an attribute it will highlight the path in DAG where this attribute comes from). The truth is that this feature is quite limited and is based on attribute names rather then IDs. however in the next generation of Spline that we currently working on we want not only to make attribute lineage working reliably in a scope of a one single execution plan, we'll try to provide a cross job attribute lineage.
Unfortunately we've got a lot things on our plate at the moment, so I cannot say when it will be implemented, but it definitely will. Stay tuned.

vidma · 2019-02-08T13:13:37Z

Spline 0.3 kinda supports it on the UI level (if you click on an attribute it will highlight the path in DAG where this attribute comes from). The truth is that this feature is quite limited and is based on attribute names rather then IDs

I would like to hear more on how it's implemented (maybe you could point me to the UI code which do find out the attribute lineage) and how much limited it is?

I guess if the query had multiple columns of the same name (e.g. coming from two different datasets which are joined) or complex expressions with unnamed "attributes" in it, it might sometimes fail?

@wajda

wajda · 2019-02-08T15:45:07Z

Sorry I was wrong saying that it's based on attribute names (it used to be that in earlier versions). In 0.3.6 it is actually based on attribute IDs, just like in Spark. Basically what Spline does is it simply takes Spark attributes and converts to Spline ones one by one, as well as operations. So if some operations share the same attribute Spline UI will simply highlight those operations.
See https://github.com/AbsaOSS/spline/blob/master/core/src/main/scala/za/co/absa/spline/core/harvester/componentCreators.scala AttributeConverter
and then
https://github.com/AbsaOSS/spline/blob/master/web/ui/src/app/lineage/lineage.store.ts getOperationIdsByAnyAttributeId()

vidma · 2019-02-08T17:36:38Z

So if some operations share the same attribute Spline UI will simply highlight those operations

more interesting would be to highlight end-to-end lineage from input dataset attributes to output dataset attributes (which may be multiple complex operations apart). As I understand this is not available yet for earlier mentioned reasons/issues?

wajda · 2019-02-11T10:18:32Z

No, that's not available yet, but we have a plan to eventually get there.

vidma · 2019-10-23T12:32:33Z

hi,

any update on this? is the per-attribute lineage supported already or are there any plans for it?

wajda · 2019-10-23T12:54:21Z

Hi,
No updates so far, it's still in the backlog. We'll get to it when we have chance.

cerveada · 2020-05-05T08:41:11Z

New data model brainstroming output

cerveada · 2020-10-08T08:29:51Z

vidma · 2020-10-08T09:12:41Z

a few points to note from our experiences at Kensu Inc.

you might want consider a few interesting special cases when part of lineage maybe need to be provided semi-manually, e.g.:

semi-manual lineage for UDFs, UDF can be matched by ScalaUDF.function.getClass.getName:
- extra UDF inputs that are local files (e.g. H2O predict via spark pipelines)
- optional annotations for lineage inside UDF if needed, e.g. by default lineage assumes that all input fields of UDF are connected to the output of UDF, but this might be manually clarified
making easy to plug-in extra semi-manual/programmable lineage to connect with the external world the parts of plan that may be unresolved by default implementation, e.g. H2O contains H2OFrameRelation extends BaseRelation with TableScan which accesses remote H2oFrame not seen by spark directly
same above points applies to RDDs

also making easier to customize stuff if needed: avoid private/final methods would be great.

finally, probably some special care needed for struct fields support.

wajda · 2021-03-04T00:41:45Z

Almost there.
The majority of work has been (finally) done in #822. It took us quite a while, but later is better than never :)
We had to extend our data model (it's about twice the size of the previous one now). The way the attribute lineage is solved is similar to the way the operation lineage is solved, meaning that the attributes are made separate entities and are stored in a DAG, but are linked with another type of edge. Basically the relation between attributes forms and additional dimension to the ultimate lineage graph. The same we did for expression graph, that connect attributes and add another level of details. So basically we currently have 4 layers in the model, where every next level extends the previous one.

There is one outstanding issue however that we'll be addressing in the future releases - #791 - data lineage for Aggregate operations.

@vidma, thanks again for your input. Please let us know if it's something that you expected? I'm really interested to hear your feedback.

The doc location - https://github.com/AbsaOSS/spline/tree/gh-pages/docs

wajda · 2021-03-04T00:56:17Z

As for the RDD lineage there is separate issue in the spark-agent repo - AbsaOSS/spline-spark-agent#33

UDF case is worth investigating, created another issue for that - AbsaOSS/spline-spark-agent#181

wajda added the Epic label Feb 6, 2019

vidma mentioned this issue Nov 15, 2019

Add custom attribute lineage extractor and Custom dispatcher #462

Closed

wajda added this to the 0.5.0 milestone Dec 4, 2019

wajda modified the milestones: 0.5.0, 0.6.0 Mar 3, 2020

wajda mentioned this issue Dec 11, 2020

Does spline support showing upper and upper column lineage? #705

Closed

wajda self-assigned this Feb 24, 2021

wajda mentioned this issue Mar 4, 2021

UDF lineage? AbsaOSS/spline-spark-agent#181

Open

wajda closed this as completed Mar 19, 2021

wajda added the st-6-closed-549f90fb label Apr 1, 2022

wajda mentioned this issue May 13, 2022

How to get attribute level lineage "across" spark-jobs. through Spline-UI/AQL ? #1088

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attribute lineage #114

Attribute lineage #114

wajda commented Feb 6, 2019

vidma commented Feb 7, 2019 •

edited

Loading

wajda commented Feb 8, 2019

vidma commented Feb 8, 2019 •

edited

Loading

wajda commented Feb 8, 2019

vidma commented Feb 8, 2019 •

edited

Loading

wajda commented Feb 11, 2019

vidma commented Oct 23, 2019

wajda commented Oct 23, 2019

cerveada commented May 5, 2020

cerveada commented Oct 8, 2020

vidma commented Oct 8, 2020 •

edited

Loading

wajda commented Mar 4, 2021 •

edited

Loading

wajda commented Mar 4, 2021

Attribute lineage #114

Attribute lineage #114

Comments

wajda commented Feb 6, 2019

vidma commented Feb 7, 2019 • edited Loading

wajda commented Feb 8, 2019

vidma commented Feb 8, 2019 • edited Loading

wajda commented Feb 8, 2019

vidma commented Feb 8, 2019 • edited Loading

wajda commented Feb 11, 2019

vidma commented Oct 23, 2019

wajda commented Oct 23, 2019

cerveada commented May 5, 2020

cerveada commented Oct 8, 2020

vidma commented Oct 8, 2020 • edited Loading

wajda commented Mar 4, 2021 • edited Loading

wajda commented Mar 4, 2021

vidma commented Feb 7, 2019 •

edited

Loading

vidma commented Feb 8, 2019 •

edited

Loading

vidma commented Feb 8, 2019 •

edited

Loading

vidma commented Oct 8, 2020 •

edited

Loading

wajda commented Mar 4, 2021 •

edited

Loading