Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute lineage #114

Closed
wajda opened this issue Feb 6, 2019 · 13 comments
Closed

Attribute lineage #114

wajda opened this issue Feb 6, 2019 · 13 comments
Assignees
Labels
Milestone

Comments

@wajda
Copy link
Contributor

wajda commented Feb 6, 2019

No description provided.

@wajda wajda added the Epic label Feb 6, 2019
@vidma
Copy link
Contributor

vidma commented Feb 7, 2019

Hi, @wajda ,

What we want is to have the ability to say which output dataset columns depend on which columns of input datasets (and also to know if that's data or control lineage).

what's the state of attribute / per-column lineage in spline? it seems that it's not supported by design?

in spline, some operations have Expressions attached to them, however, they do not provide any clear relationship to which output columns were affected by operations like (FYI spark query plans have this information available as operation's output expression IDs (exprId) which are easily mappable to output attribute names - we use it our in-house spark query plan parser at kensu.io to obtain column lineage and it works pretty well).

are there any plans to add this?

P.S. there might be a hacky way to parse existing spline operations to obtain limited column lineage, however, it doesn't seem very reliable (and it seems it might not catch all column lineage...):

data lineage: operations like

  • Projection (expect all Projection entries to be in the form of expr.Alias(outputAttributeName, ...)
  • Aggregate (expect aggregate output column names to be mappable to output dataset attribute ids)

control lineage: operations like Join, Filter, Sort (add control lineage to all output "attributes")

it would be much better if each operation clearly defined which output attributes depend on which input data...

@wajda
Copy link
Contributor Author

wajda commented Feb 8, 2019

Hi @vidma,
Yes, we do have it in the roadmap, and it has a priority. Spline 0.3 kinda supports it on the UI level (if you click on an attribute it will highlight the path in DAG where this attribute comes from). The truth is that this feature is quite limited and is based on attribute names rather then IDs. however in the next generation of Spline that we currently working on we want not only to make attribute lineage working reliably in a scope of a one single execution plan, we'll try to provide a cross job attribute lineage.
Unfortunately we've got a lot things on our plate at the moment, so I cannot say when it will be implemented, but it definitely will. Stay tuned.

@vidma
Copy link
Contributor

vidma commented Feb 8, 2019

Spline 0.3 kinda supports it on the UI level (if you click on an attribute it will highlight the path in DAG where this attribute comes from). The truth is that this feature is quite limited and is based on attribute names rather then IDs

I would like to hear more on how it's implemented (maybe you could point me to the UI code which do find out the attribute lineage) and how much limited it is?

I guess if the query had multiple columns of the same name (e.g. coming from two different datasets which are joined) or complex expressions with unnamed "attributes" in it, it might sometimes fail?

@wajda

@wajda
Copy link
Contributor Author

wajda commented Feb 8, 2019

Sorry I was wrong saying that it's based on attribute names (it used to be that in earlier versions). In 0.3.6 it is actually based on attribute IDs, just like in Spark. Basically what Spline does is it simply takes Spark attributes and converts to Spline ones one by one, as well as operations. So if some operations share the same attribute Spline UI will simply highlight those operations.
See https://github.com/AbsaOSS/spline/blob/master/core/src/main/scala/za/co/absa/spline/core/harvester/componentCreators.scala AttributeConverter
and then
https://github.com/AbsaOSS/spline/blob/master/web/ui/src/app/lineage/lineage.store.ts getOperationIdsByAnyAttributeId()

@vidma
Copy link
Contributor

vidma commented Feb 8, 2019

So if some operations share the same attribute Spline UI will simply highlight those operations

more interesting would be to highlight end-to-end lineage from input dataset attributes to output dataset attributes (which may be multiple complex operations apart). As I understand this is not available yet for earlier mentioned reasons/issues?

@wajda
Copy link
Contributor Author

wajda commented Feb 11, 2019

No, that's not available yet, but we have a plan to eventually get there.

@vidma
Copy link
Contributor

vidma commented Oct 23, 2019

hi,

any update on this? is the per-attribute lineage supported already or are there any plans for it?

@wajda
Copy link
Contributor Author

wajda commented Oct 23, 2019

Hi,
No updates so far, it's still in the backlog. We'll get to it when we have chance.

@wajda wajda added this to the 0.5.0 milestone Dec 4, 2019
@wajda wajda modified the milestones: 0.5.0, 0.6.0 Mar 3, 2020
@cerveada
Copy link
Contributor

cerveada commented May 5, 2020

New data model brainstroming output
spline-brainstroming2.jpg

@cerveada
Copy link
Contributor

cerveada commented Oct 8, 2020

20200520_100021

@vidma
Copy link
Contributor

vidma commented Oct 8, 2020

a few points to note from our experiences at Kensu Inc.

you might want consider a few interesting special cases when part of lineage maybe need to be provided semi-manually, e.g.:

  • semi-manual lineage for UDFs, UDF can be matched by ScalaUDF.function.getClass.getName:
    • extra UDF inputs that are local files (e.g. H2O predict via spark pipelines)
    • optional annotations for lineage inside UDF if needed, e.g. by default lineage assumes that all input fields of UDF are connected to the output of UDF, but this might be manually clarified
  • making easy to plug-in extra semi-manual/programmable lineage to connect with the external world the parts of plan that may be unresolved by default implementation, e.g. H2O contains H2OFrameRelation extends BaseRelation with TableScan which accesses remote H2oFrame not seen by spark directly
  • same above points applies to RDDs

also making easier to customize stuff if needed: avoid private/final methods would be great.

finally, probably some special care needed for struct fields support.

@wajda
Copy link
Contributor Author

wajda commented Mar 4, 2021

Almost there.
The majority of work has been (finally) done in #822. It took us quite a while, but later is better than never :)
We had to extend our data model (it's about twice the size of the previous one now). The way the attribute lineage is solved is similar to the way the operation lineage is solved, meaning that the attributes are made separate entities and are stored in a DAG, but are linked with another type of edge. Basically the relation between attributes forms and additional dimension to the ultimate lineage graph. The same we did for expression graph, that connect attributes and add another level of details. So basically we currently have 4 layers in the model, where every next level extends the previous one.

There is one outstanding issue however that we'll be addressing in the future releases - #791 - data lineage for Aggregate operations.

@vidma, thanks again for your input. Please let us know if it's something that you expected? I'm really interested to hear your feedback.

image

The doc location - https://github.com/AbsaOSS/spline/tree/gh-pages/docs

@wajda
Copy link
Contributor Author

wajda commented Mar 4, 2021

As for the RDD lineage there is separate issue in the spark-agent repo - AbsaOSS/spline-spark-agent#33

UDF case is worth investigating, created another issue for that - AbsaOSS/spline-spark-agent#181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

3 participants