[SPARK] fix aggregate on databricks #1867

pawel-big-lebowski · 2023-05-19T06:47:26Z

Problem

Column lineage does not get collected on databricks for aggregate queries like groupBy.

Solution

There is a slightly different implementation of AggregateExpression which instead of resultId method contains resultIds which is a list.

Note: All schema changes require discussion. Please link the issue for context.

Your change modifies the core OpenLineage model
Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

You've signed-off your work
Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

codecov-commenter · 2023-05-19T07:23:31Z

Codecov Report

Merging #1867 (b64749e) into main (4be5217) will decrease coverage by 16.45%.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@              Coverage Diff              @@
##               main    #1867       +/-   ##
=============================================
- Coverage     81.05%   64.61%   -16.45%     
  Complexity      100      100               
=============================================
  Files            80       26       -54     
  Lines          3420      373     -3047     
  Branches         27       27               
=============================================
- Hits           2772      241     -2531     
+ Misses          617      101      -516     
  Partials         31       31

see 54 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

boring-cyborg bot added area:documentation Improvements or additions to documentation area:integration/spark labels May 19, 2023

[SPARK] fix aggregate on databricks

b64749e

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski force-pushed the spark/fix-databricks-group-by branch from 6722775 to b64749e Compare May 19, 2023 06:48

pawel-big-lebowski marked this pull request as ready for review May 19, 2023 06:51

pawel-big-lebowski mentioned this pull request May 19, 2023

Azure databricks use aggregate funcation can not buildColumnLineageDatasetFacet #1821

Closed

pawel-big-lebowski self-assigned this May 19, 2023

pawel-big-lebowski added the tool:databricks Databricks label May 19, 2023

mobuchowski approved these changes May 19, 2023

View reviewed changes

pawel-big-lebowski merged commit 724a0ca into main May 19, 2023
15 checks passed

pawel-big-lebowski deleted the spark/fix-databricks-group-by branch May 19, 2023 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] fix aggregate on databricks #1867

[SPARK] fix aggregate on databricks #1867

pawel-big-lebowski commented May 19, 2023 •

edited

codecov-commenter commented May 19, 2023

[SPARK] fix aggregate on databricks #1867

[SPARK] fix aggregate on databricks #1867

Conversation

pawel-big-lebowski commented May 19, 2023 • edited

Problem

Solution

One-line summary:

Checklist

codecov-commenter commented May 19, 2023

Codecov Report

pawel-big-lebowski commented May 19, 2023 •

edited