Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support attribute and expression level lineage for MapElements, SerializeFromObject, DeserializeToObject #342

Open
cerveada opened this issue Oct 26, 2021 · 9 comments
Labels

Comments

@cerveada
Copy link
Contributor

Lineages generated from code showed in discussions/341 are missing connections between attributes and expressions.

Let's add support for that.

@carpe-erin
Copy link

@cerveada what do you think is the priority on getting this implemented?

@wajda
Copy link
Contributor

wajda commented Nov 2, 2021

MapElements is created from the RDD style method map(fn), where fn is a lambda, so no attribute lineage can be inferred automatically.

We might try to solve it using custom annotations on the model case classes for instance to carry the missing compile time information to the runtime, but this isn't something that can be done quickly. I'd say this is a nice feature request that could be addressed in the scope of solving RDD lineage gaps in #33

@wajda
Copy link
Contributor

wajda commented Nov 2, 2021

I didn't quite get about SerializeFromObject and DeserializeToObject, what is missing there?

@wajda wajda added this to the 1.1.0 milestone Nov 2, 2021
@carpe-erin
Copy link

The lineage of the example in this discussion: #341 currently outputs 5 operations:

  1. LogicalRelation
  2. DeserializeToObject
  3. MapElements
  4. SerializeFromObject
  5. InsertIntoHadoopFsRelationCommand

I am not too certain on the details of what occurs in the DeserializeToObject and SerializeFromObject to be honest. When I dug into the collections in the ArangoDB, my biggest issue was trying to find a connection from fields in the MapElements to the output of SerializeFromObject. The argumentSchema field in the operation collection on the MapElements operation gave me an idea of what was in the the obj but I couldn't find anything in the expression collection that told me how those fields mapped to SerializeFromObject.

@wajda
Copy link
Contributor

wajda commented Nov 5, 2021

That's because the connections between fields in MapElements aren't visible at runtime, neither to Spline nor to Spark. The transformation happens in a lambda function, and the only thing we know about it at runtime is that it takes one object as an input and returns another objects as an outputs. How exactly the fields of that object are computed is covered with the darkness of bytecode.

@carpe-erin
Copy link

@wajda Ah okay. So it sounds like it isn't possible to create this feature?

@wajda
Copy link
Contributor

wajda commented Nov 5, 2021

Well, it's practically impossible to do it automatically. In theory we could try to decompile and reverse engineer the bytecode in attempts to recover the tracing between the fields, but you know, the amount of work is significant and the outcome is not really predictable or guaranteed. So I would prefer not going that route.
What is possible however is to create a few annotations or a DSL that can be used to add that missing meta information to Spline in a declarative way right from the code. That of course requires additional effort from the job developer, and it creates a hard dependency on Spline agent library, but it sounds like a good compromise.
The same dilemma and solution was discussed in the context of RDD lineage support, that's why I said it could be solved there.

@carpe-erin
Copy link

@wajda ah I see. Looking at that outer feature request, while it isn't the most ideal solution, I think we could make that work for us. So I am fine with closing out this request for now.

@wajda
Copy link
Contributor

wajda commented Nov 5, 2021

Leave it open please, for ease of tracking.

@wajda wajda removed this from the 1.1.0 milestone Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New
Development

No branches or pull requests

3 participants