Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDF lineage? #181

Open
wajda opened this issue Mar 4, 2021 · 1 comment
Open

UDF lineage? #181

wajda opened this issue Mar 4, 2021 · 1 comment
Assignees
Labels

Comments

@wajda
Copy link
Contributor

wajda commented Mar 4, 2021

To be investigated.

a few points to note from our experiences at Kensu Inc.

you might want consider a few interesting special cases when part of lineage maybe need to be provided semi-manually, e.g.:

  • semi-manual lineage for UDFs, UDF can be matched by ScalaUDF.function.getClass.getName:
    • extra UDF inputs that are local files (e.g. H2O predict via spark pipelines)
    • optional annotations for lineage inside UDF if needed, e.g. by default lineage assumes that all input fields of UDF are connected to the output of UDF, but this might be manually clarified
  • making easy to plug-in extra semi-manual/programmable lineage to connect with the external world the parts of plan that may be unresolved by default implementation, e.g. H2O contains H2OFrameRelation extends BaseRelation with TableScan which accesses remote H2oFrame not seen by spark directly
  • same above points applies to RDDs

also making easier to customize stuff if needed: avoid private/final methods would be great.

finally, probably some special care needed for struct fields support.

Originally posted by @vidma in AbsaOSS/spline#114 (comment)

@wajda wajda added the feature label Mar 4, 2021
@wajda wajda added this to the 1.0.0 milestone Jun 22, 2021
@cerveada cerveada self-assigned this Oct 25, 2021
@cerveada
Copy link
Contributor

cerveada commented Nov 1, 2021

Current support for Scala UDFs in Spline

Now Scala UDFs are captured and stored as expressions, they seem to have correct input and output. There are not many details about the function itself. For example UDF that consists of several expressions will be represented as just one UDF expression. From lineage point of view this seems to be still a lot of detail, since even expression level lineage is provided.

When UDF is pure function (using only inputs to compute output) all seems to work already. The issue is when the UDF is using another external data. Then this data source is not captured.

Adding additional lineage info

Post Processing Filters seems to be a good solution to this problem. It is possible to select Scala UDF by name using currently captured information.

Using annotation or another form of marking the function seems to be unnecessary since the name is already an identification. But if it is needed some wrapping function could be created that would contain the additional info for spline and just delegate the call to the wrapped UDF.

Since we are able to find the function inside a filter, we can add additional expressions and modify it to better represent the actual lineage.

Adding another data source may be more complicated, since spline expects all expression data to come from operations. But still it can be done via filter. (Here it would help to have some example what kind of external data are used)

We could create some filters that would do the common tasks for UDF and left the user to add the additional info they want to provide, but each such filter is also useful only for the intended use case, whereas the general filter that is already available can each user modify as they please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

2 participants