Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute lineage: Dependency from aggregation expressions to preceding selection operations #791

Open
wajda opened this issue Oct 13, 2020 · 3 comments
Milestone

Comments

@wajda
Copy link
Contributor

wajda commented Oct 13, 2020

In the following pseudo query, output attribute b should depend not only on a, but also on c and g even though they are not directly present in b's AST.

SELECT AGG(a) as b, g
  FROM t
 WHERE c
 GROUP BY g
@wajda wajda added the feature label Oct 13, 2020
@wajda wajda added this to the 1.1.0 milestone Dec 11, 2020
@wajda wajda mentioned this issue Mar 4, 2021
@vidma
Copy link
Contributor

vidma commented Mar 5, 2021

@wajda I'd say might also depend on g also, as both c & g provide control lineage - the result value depend on the value of g even if indirectly. thus this we consider two types of lineages -data & control, data being the main one.

@wajda
Copy link
Contributor Author

wajda commented Mar 5, 2021

Yes, correct, of course, it's just an oversight in description. Thanks for pointing it out! :)

@wajda
Copy link
Contributor Author

wajda commented Mar 7, 2023

Just making a quick note of an idea.

This problem can be modeled in the following way. Every operation can have properties like if it potentially changes the number of rows (e.g. filter, union) or the order of rows (e.g. sort). Additionally if the operation changes either the number or the order of elements it can provide a reference to a corresponding expression or attribute involved in the process.

Then, on the expression definition level it could be specified if the expression is an aggregation, and if it's commutative.
Based on the above information we could derive additional dependencies for expressions/attributes.

Pseudo code:

trait OperationLike {
  ...
  def properties: Option[OperationProperties]
}

class OperationProperties (
  val resize: Option[Boolean | AttrOrExprRef] = false
  val reorder: Option[Boolean | AttrOrExprRef] = false
)

trait ExpressionLike {
  ...
  def isAggregation: Boolean = false
  def isCommutative: Boolean = false
}

So, considering example from the issue description, AGG could be marked as a aggregation operation that automatically makes it dependent of the number of elements in the input dataset which in turn is controlled by a filtering expression c. The WHERE clause could be defined in terms on pseudo Spline model like this:

operations: [
  {
    name: "WHERE",
    properties: {
      resize: {__exprId: "expr-111"}
    }
  }
]

attributes: [
  {
    id: "attr-a",
    name: "a"
  },
  {
    id: "attr-b",
    name: "b",
    childRefs: [{__exprId: "expr-222"}]
  },
]

expressions: [
  {
    id: "expr-111",
    name: "c"
  },
  {
    id: "expr-222",
    name: "AGG",
    isAggregation: true,
    childRefs: [{__exprId: "attr-a"}]
  },
]

As seen, the attribute b directly depends only on the attribute a through the expression expr-222. But because the expression expr-222 is an aggregation expression we look for the closest expression or attribute in the preceding sub-graph that influence the number of elements in the dataset (resize != false). That appears to be expression c (expr-111) in the WHERE operation. So we can implicitly add the expression c in the b's dependency list.

If the AGG is also marked as isCommutative: false then we would also look for the closest expression or attribute involved in the row reordering, that we could find in one of the ORDER BY operations for example.

A note for further thinking - how to model e.g. distinct operation that changes the number of elements, but preserves the dataset cardinality? How would the dependency from the MAX/MIN and SUM aggregation expressions could look like in that context?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

2 participants